Barcelona's Digital Landscape: a data-driven exploration of urban dynamics around Sagrada Familia. AI-generated by our team.

Barcelona’s Digital Landscape: a data-driven exploration of urban dynamics around Sagrada Familia. AI-generated by our team.

1. Introduction

1.1 Presentation of the case

A tourism company based in Zürich, Switzerland, has observed a significant increase in travel demand to Barcelona in recent years. Indeed, Barcelona ranks as the third most in-demand city for Airbnb rentals in Europe, behind Paris and London.

Consequently, the company’s manager has requested a Machine Learning study and analysis of Airbnb accommodations in the city. The goal is to understand price behavior and identify the factors influencing accommodation costs and occupancy, enabling the company to provide optimal responses to clients’ inquiries.

To achieve this goal, the team has decided to analyse and address three question to provide comprehensive insights for the manager.

  1. What are the key factors influencing accommodation prices in Barcelona?
  2. Can we predict occupancy rates based on location, amenities, or other factors?
  3. How accurately can machine learning predict Airbnb prices in Barcelona? Which models perform best for this dataset?

1.2 Motivations

Barcelona is one of the most visited cities in Europe, and the rise of Airbnb and other short-term rental platforms has led to a notable increase in tourism. However, this growth also presents challenges for accommodation businesses and the local housing market. A study conducted by the Social Science Research Network (SSRN, link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3428237) revealed that rental costs in neighborhoods with high Airbnb activity increased by 7% between 2009 and 2016. This is primarily due to the fact that property owners, motivated by the demand from tourists seeking short-term rentals, frequently opt to lease their properties at higher rates during the short term rather than committing to long-term leases.

For these reasons, it is crucial for tourism companies to understand this dynamic market to remain competitive and provide tailored services to their clients.

The Zürich-based tourism company needs reliable data on Airbnb prices and occupancy rates to make data-driven recommendations and stay ahead of competitors.

1.3 Disclaimer

This analysis is for educational purposes only. The findings are based on public data and are not professional advice. The results should not be used for business or policy decisions.

1.4 Dataset selected

To conduct the study, the team has decided to analyse a dataset of Barcelona Airbnbs available on the Kaggle website (link: https://www.kaggle.com/datasets/fermatsavant/airbnb-dataset-of-barcelona-city)

The dataset consists of 19.833 observations across 25 variables, including geographical zones, amenities, prices, and accommodations.

Below, we can see the structure of the dataset, and the names and data types for each column.

## Rows: 19,833
## Columns: 25
## $ X                     <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ id                    <int> 18666, 18674, 21605, 23197, 25786, 31377, 31380,…
## $ host_id               <int> 71615, 71615, 82522, 90417, 108310, 134698, 1346…
## $ host_is_superhost     <chr> "f", "f", "f", "t", "t", "f", "f", "f", "f", "f"…
## $ host_listings_count   <int> 45, 45, 2, 5, 1, 9, 9, 41, 41, 1, NA, 3, 3, 4, 4…
## $ neighbourhood         <chr> "Sant Martí", "La Sagrada Família", "Sant Martí"…
## $ zipcode               <chr> "8026", "8025", "8018", "8930", "8012", "8025", …
## $ latitude              <dbl> 41.40889, 41.40420, 41.40560, 41.41203, 41.40145…
## $ longitude             <dbl> 2.18555, 2.17306, 2.19821, 2.22114, 2.15645, 2.1…
## $ property_type         <chr> "Apartment", "Apartment", "Apartment", "Apartmen…
## $ room_type             <chr> "Entire home/apt", "Entire home/apt", "Private r…
## $ accommodates          <int> 6, 8, 2, 6, 2, 2, 3, 4, 5, 1, 6, 2, 8, 2, 1, 1, …
## $ bathrooms             <dbl> 1.0, 2.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0…
## $ bedrooms              <int> 2, 3, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1, 4, 1, 1, 1, …
## $ beds                  <int> 4, 6, 1, 8, 1, 2, 2, 1, 3, 1, 7, 1, 6, 1, 1, 1, …
## $ amenities             <chr> "['TV', 'Internet', 'Wifi', 'Air conditioning', …
## $ price                 <chr> "$130.00", "$60.00", "$33.00", "$210.00", "$45.0…
## $ minimum_nights        <int> 3, 1, 2, 3, 1, 3, 3, 1, 1, 29, 2, 4, 5, 2, 2, 2,…
## $ has_availability      <chr> "t", "t", "t", "t", "t", "t", "t", "t", "t", "t"…
## $ availability_30       <int> 0, 3, 4, 11, 8, 5, 3, 2, 3, 4, 2, 25, 9, 1, 3, 2…
## $ availability_60       <int> 0, 20, 8, 33, 19, 8, 8, 17, 19, 4, 16, 55, 31, 3…
## $ availability_90       <int> 0, 50, 15, 63, 41, 16, 15, 29, 31, 4, 42, 80, 61…
## $ availability_365      <int> 182, 129, 15, 318, 115, 211, 211, 266, 257, 26, …
## $ number_of_reviews_ltm <int> 0, 10, 36, 16, 49, 0, 2, 34, 15, 0, 10, 0, 24, 6…
## $ review_scores_rating  <int> 80, 87, 90, 95, 95, 95, 87, 92, 88, 99, 87, 68, …

Numerical values:

  • X: numerical index for rows
  • id: unique identifier for listings
  • host_id: unique identifier for hosts
  • host_listings_count: number of listings by the host
  • latitude: geographic latitude of the listing
  • longitude: geographic longitude of the listing
  • accommodates: number of guests the listing can accommodate
  • bathrooms: number of bathrooms in the listing
  • bedrooms: number of bedrooms in the listing
  • beds: number of beds in the listing
  • minimum_nights: minimum number of nights required for booking
  • availability_30: number of available nights in the next 30 days
  • availability_60: number of available nights in the next 60 days
  • availability_90: number of available nights in the next 90 days
  • availability_365: number of available nights in the next 365 days
  • number_of_reviews_ltm: number of reviews in the last 12 months
  • review_scores_rating: average review rating score

Binary variables:

  • host_is_superhost: indicates if the host is a superhost ("t" or "f")
  • has_availability: indicates if the listing is available for booking ("t" or "f")

String values:

  • neighbourhood: name of the neighbourhood where the listing is located
  • zipcode: postal code of the listing
  • property_type: type of property (e.g., “Apartment”)
  • room_type: type of room (e.g., “Entire home/apt”)
  • amenities: list of amenities provided in the listing
  • price: price of the listing as a string (e.g., “$130.00”)

As can be appreciated, the variable ‘price’ has a ‘Character’ data type. Therefore, in the chapter 3, this field will be transformed into an integer variable to enable the necessary calculations.

1.5 Sub-sampling

In order to streamline the calculations and analysis, a sub-dataset will be created in the following steps, considering 10.000 observations randomly selected. Additionally, a seed is created to ensure the same observations are maintained throughout the analysis

2. Methodology

To address the research question, the study will be divided into three parts. First, an Exploratory Data Analysis (EDA) will be conducted to gain a deeper understanding of the data. Second, Machine Learning models will be implemented, and their performance will be evaluated to identify the best-performing model. Finally, the selected model will be used to provide the most accurate answer to the research question posed by the team.

The different models to be developed are:

  1. Linear Models
  2. Generalised Linear Models with family set to Poisson for binary data
  3. Generalised Linear Models for multinomial data
  4. Generalised Additive Models
  5. Neural Network
  6. Support Vector Machine

3 Exploratory Data Analysis

3.1 Converting Price variable to numeric

First, the pricing variable will be converted into a numeric format, and in the fifth chapter of this report (Machine Learning Models), the categorical variables will be transformed into factors for further analysis and modeling.

BCN_Accomm_sub$price <- gsub(",", "", BCN_Accomm_sub$price) # removed ','
BCN_Accomm_sub$price <- gsub("\\$", "", BCN_Accomm_sub$price) # removed '$' sign
BCN_Accomm_sub$price <- as.numeric(BCN_Accomm_sub$price)  # converted to number format

3.2 Missing values (detect and treat)

We inspect the dataset to have idea about possible missing values, their amount and their distribution.

Missing Values by Variable
Missing_Count Missing_Percent
review_scores_rating 2415 24.15%
host_listings_count 22 0.22%
beds 18 0.18%
bathrooms 6 0.06%
bedrooms 2 0.02%
X 0 0%
id 0 0%
host_id 0 0%
host_is_superhost 0 0%
neighbourhood 0 0%
zipcode 0 0%
latitude 0 0%
longitude 0 0%
property_type 0 0%
room_type 0 0%
accommodates 0 0%
amenities 0 0%
price 0 0%
minimum_nights 0 0%
has_availability 0 0%
availability_30 0 0%
availability_60 0 0%
availability_90 0 0%
availability_365 0 0%
number_of_reviews_ltm 0 0%

3.2.1 How to manage Missing values for each field (MD)

host_listings_count : since is not possible to make any calculation on the number of listing of the host, we exclude the 22 rows that lack of it. bathrooms : the number of bathrooms is missing in 6 rows. beds : the number of beds is not specified for 18 assets. bedrooms : 2 rows contains missing value and can be deleted. review_scores_rating : the review score rating is missing in 2415 rows of 10000. It’s a quite relevant percentage, around the 24% of the data we selected. In this case we decide to impute the missing values replacing it with the value 0.

Eventually, the other fields that present missing values do not allow to replace the empty data with estimates (average, mean, …) so called imputation. This fields include in total 48 assets that represent 0.48% of the total assets and allows to delete the entire rows without loosing too much information.

3.3 Correlation matrix

# Create a correlation matrix for numeric fields
cor_BNC_Accomm <- select_if(BCN_Accomm, is.numeric) %>%
  select(-c(id, X, host_id))

# make a data frame
cor_BNC_Accomm <- data.frame(cor_BNC_Accomm)
str(cor_BNC_Accomm)
## 'data.frame':    9953 obs. of  15 variables:
##  $ host_listings_count  : int  9 6 39 6 109 1 0 16 32 2 ...
##  $ latitude             : num  41.4 41.4 41.4 41.4 41.4 ...
##  $ longitude            : num  2.17 2.2 2.17 2.17 2.14 ...
##  $ accommodates         : int  5 5 6 16 4 4 1 2 2 4 ...
##  $ bathrooms            : num  1 1 1 6 1 1 1 1 1 1.5 ...
##  $ bedrooms             : int  2 2 2 7 2 1 1 0 1 1 ...
##  $ beds                 : int  5 5 5 13 2 2 1 1 2 2 ...
##  $ price                : num  105 25 85 899 83 45 29 50 165 70 ...
##  $ minimum_nights       : int  32 1 1 2 1 2 1 1 3 1 ...
##  $ availability_30      : int  9 18 21 3 2 0 10 17 19 17 ...
##  $ availability_60      : int  39 48 51 24 16 0 30 47 23 47 ...
##  $ availability_90      : int  40 78 71 47 26 0 60 77 42 59 ...
##  $ availability_365     : int  40 353 327 241 297 0 335 352 127 64 ...
##  $ number_of_reviews_ltm: int  0 12 2 7 0 0 2 4 9 21 ...
##  $ review_scores_rating : num  0 83 90 92 0 96 100 90 86 97 ...
# print correlation matrix
corrplot(cor(cor_BNC_Accomm), type = "upper", order = "hclust", tl.col = "black")

From the correlation matrix is possible to deduct the following characteristics: - there’s almost no correlation between availability periods and number of beds, bedrooms, bathrooms. It would suggest that the availability of the house do not depend from those features, rather probably from the location and facilities. - there is a positive correlation between number of bedrooms, beds and bathrooms. - there is a positive correlation between the availability periods.

3.4 Outliers (boxplots) (detect and treat)

Since the price variable is a key focus of our analysis, an outlier analysis of this variable has been conducted.

3.4.1 Count of outliers in Price

Out of a total of 10,000 values, 804 (8.04%) are identified as outliers. Below, a boxplot is presented to visualize the median and the outlier observations.

From the boxplot above, it can be concluded that the median price is approximately 65€ per night, with 50% of the observations concentrated between 40€ (25th percentile) and 112€ (75th percentile), representing the interquartile range (IQR).

Additionally, the presence of numerous outliers extending to the right indicates a right-skewed distribution, meaning higher prices are influencing the dataset.

The significant number of observations with higher prices could suggest the presence of many luxury properties. Therefore, further analysis is required to identify the factors influencing these price variations.

3.4.2 Detecting Top 20 Price Outliers

Table 2: Top 20 and Outliers for Price Variable with Neighbourhood
neighbourhood property_type bedrooms price
Gràcia Bed and breakfast 1 8000
Sants-Montjuïc Boat 4 8000
Vila de Gràcia Bed and breakfast 1 8000
Vila de Gràcia Bed and breakfast 1 8000
Vila de Gràcia Bed and breakfast 1 8000
Vila de Gràcia Bed and breakfast 1 8000
Eixample Boutique hotel 1 6000
Eixample Hotel 1 6000
Eixample Hotel 1 6000
Eixample Hotel 1 6000
Eixample Hotel 1 6000
Eixample Hotel 1 6000
La Nova Esquerra de l’Eixample Hotel 1 6000
La Nova Esquerra de l’Eixample Hotel 1 6000
La Nova Esquerra de l’Eixample Hotel 1 6000
La Nova Esquerra de l’Eixample Hotel 1 6000
Sant Antoni Boutique hotel 1 6000
Sant Antoni Boutique hotel 1 6000
Sant Antoni Hotel 1 6000
Sant Antoni Hotel 1 6000

It seems that the price variable may contain erroneous entries. For further analysis, research revealed that the average nightly rate for an Airbnb in Barcelona is €93 (according to Hostel Geeks, link: https://hostelgeeks.com/best-airbnbs-in-barcelona-spain/). Therefore, prices of €8,000 are likely errors. As a result, it was decided to exclude prices above €1,000 from the analysis.

The new summary for tha Price variable is the following:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   40.00   65.00   96.93  110.00 1000.00

Now, a new Correlation Matrix with the filtered observations of the price variable is displayed.

## 'data.frame':    9953 obs. of  15 variables:
##  $ host_listings_count  : int  9 6 39 6 109 1 0 16 32 2 ...
##  $ latitude             : num  41.4 41.4 41.4 41.4 41.4 ...
##  $ longitude            : num  2.17 2.2 2.17 2.17 2.14 ...
##  $ accommodates         : int  5 5 6 16 4 4 1 2 2 4 ...
##  $ bathrooms            : num  1 1 1 6 1 1 1 1 1 1.5 ...
##  $ bedrooms             : int  2 2 2 7 2 1 1 0 1 1 ...
##  $ beds                 : int  5 5 5 13 2 2 1 1 2 2 ...
##  $ price                : num  105 25 85 899 83 45 29 50 165 70 ...
##  $ minimum_nights       : int  32 1 1 2 1 2 1 1 3 1 ...
##  $ availability_30      : int  9 18 21 3 2 0 10 17 19 17 ...
##  $ availability_60      : int  39 48 51 24 16 0 30 47 23 47 ...
##  $ availability_90      : int  40 78 71 47 26 0 60 77 42 59 ...
##  $ availability_365     : int  40 353 327 241 297 0 335 352 127 64 ...
##  $ number_of_reviews_ltm: int  0 12 2 7 0 0 2 4 9 21 ...
##  $ review_scores_rating : num  0 83 90 92 0 96 100 90 86 97 ...

We can observe that the price variable is now most correlated with variables related to the size and capacity of an Airbnb, such as bathrooms, bedrooms, number of beds, and accommodates.

On the other hand, general availability and review scores have little impact on the price.

3.5 Variables inspection

3.5.1 Histograms for Numerical Variables

From the histograms above, several variables exhibit right-skewed distributions, including price, minimum_nights, and number_of_reviews_ltm.

On the other hand, the data suggests that in Barcelona, Airbnb listings are primarily designed for small groups of people seeking short-term stays. Additionally, these accommodations tend to receive high review scores, indicating good guest satisfaction with the different properties.

3.5.2 Pie Charts for Binary Variable “Host_is_superhost”

Almost 19% of the hosts offering an Airbnb in Barcelona are not categorized as Superhosts. This means tourists can find accommodations in the city where hosts go above and beyond to provide excellent hospitality. This insight could be a key factor in explaining the higher price values observed in certain neighborhoods.

3.5.3 Plots for String Variables

This section presents a variety of plots for the categorical variables, including Property Type, Room Type, Top Neighbourhoods, and Amenities.

From the plot above, it can be observed that apartments dominate the Airbnb market in Barcelona, accounting for 86% of the listings.

On the other hand, the low availability of luxury or specialized accommodations, such as Boutique Hotels (0.5%), Guest Suites (0.7%), and Lofts (2.4%), suggests that these property types cater to a niche market. Travelers opting for these accommodations are likely visiting Barcelona for specific reasons, such as work or unique travel experiences.

The majority of Room Type are split between Entire home/Apartment and Private Room.

Less than 1% of the hosts offer Shared Room, which suggests that travelers prefer more privacy during the stay.

The Eixample district of Barcelona represents the most popular neighbourhood on Airbnb, with 27% of the total listings. This is followed by Ciutat Vella, which accounts for 18.8% of the listings.

Eixample is situated in close proximity to the historic centre of the city and is more centrally located in comparison to other neighbourhoods. The area offers many attractions for tourists, including La Sagrada Familia, Casa Batlló, and Passeig de Gràcia. In addition to its excellent transport connections, Eixample is an ideal destination for visitors.

On the other hand, Ciutat Vella is the oldest part of Barcelona and serves as the heart of the city, known for its historical charm and vibrant cultural scene.

Given this, the Tourism Company in Zürich could recommend that its clients focus on these neighbourhoods to attract more customers and enhance their travel experience.

3.5.4 Worldcloud for Amaneties

The Wordcloud above, provides the most common amenities offered by the different hosts.

The most prominent amenities are: Kitchen, Wifi, Heating, Washer and Hair dryer. his can be taken to indicate that tourists may consider a place to be comfortable for their stay if it meets these basic requirements.

3.6 Summary Statistics

4 Visualization Insights

This section was designed to allow the employees of the Tourism Company and other Users to interact with the data on neighborhoods, prices, and reviews.

4.1 Airbnb locations by neighbourhood with Interactive panel

The purpose of the following interactive plot is to allow users to select a neighborhood of interest and visualize, on a map, the different accommodations available along with their price per night when one of the circles is clicked.

4.2 Heatmap of prices

In the heatmap below, users can observe the zones with higher accommodation prices (red/orange areas).

In contrast, the zones colored in green or blue represent lower-priced neighborhoods.

According to the heatmap, the Tourism Company can recommend the red zones to tourists looking for more centralized accommodations, regardless of price. On the other hand, tourists who want to save money can be advised to choose accommodations in the green or blue areas, which are typically farther from the city center.

4.3 Review Score Rating vs. Price by Room Type

From the plot above, the following insights can be derived: - The majority of listings are concentrated at the lower price range (below 250 Euros), irrespective of room type. - Accommodations with high review scores (exceeding 90 points) are distributed across all price categories, indicating that well-reviewed Airbnbs are not restricted to a particular room type or price range.

5 Machine Learning Models

5.1 Variables treatment

In this chapter, different machine learning models will be explored to predict Airbnb prices and the occupancy rate over the next 30 days.

The formula to calculate the Occupancy rate in 30 days is:

Occupancy Rate: \[ \text{Occupancy Rate} = \left(1 - \frac{\text{Availability 30 Days}}{\text{Total Days = 30}}\right) \times 100 \]

According to the formula above, a new columns with the Occupancy rate is calculated and the head data of the new variable Occupancy_rate_30 is:

## [1]  70.00000  40.00000  30.00000  90.00000  93.33333 100.00000
## [1] "numeric"

With this new predictor, the occupancy rate in the next 30 days is going to be predicted.

As mentioned in the previous chapters, the categorical variables are converted into factors to proceed with the modeling phase. These are: neighbourhood, property_type, room_type, availability_30, zipcode. Moreover, variables that represent counts or continuous numbers are converted into numeric: accommodates, bathrooms, bedrooms, beds, latitude, longitude, review_scores_rating and minimum_nights.

5.2 Train and Test data

Before analysing the different models, we need to divide the data into a training set and test set. The first set will be used to find the relationship between dependent and independent variable, while the second set will be used to analyse the performance of the models. We decide to use 60% of the data set as a training set, and the rest as a test set. We also remove rows with NAs values in the test set to avoid problems in the evaluation of models’ prediction.

set.seed(1000)
# Define the number of groups and the amount of sample for each
group <- sample(2, nrow(BCN_Accomm),
                replace = TRUE,
                prob = c(0.6, 0.4))

# training data set with around 60% of the samples
train <- BCN_Accomm[group==1,]

# test data set with around 40% of the samples
test <- BCN_Accomm[group==2,]
test <- na.omit(test) # removing rows with NAs values might reduce the size of the test set

Moreover, we assure that all factor variables have the same levels in the train and test sets, including a final check to assure that no NAs values are present as concerns numerical variables.

## 'data.frame':    9875 obs. of  26 variables:
##  $ X                    : int  16886 3429 3695 3051 11158 8191 18373 17206 11077 13409 ...
##  $ id                   : int  34191752 6787210 7555948 5767967 24448300 19424985 35589671 34523402 24344024 29030221 ...
##  $ host_id              : int  224372816 15681396 3911721 2151490 163379623 96917715 11951658 4006184 9784103 218785807 ...
##  $ host_is_superhost    : chr  "t" "f" "f" "f" ...
##  $ host_listings_count  : int  9 6 39 6 109 1 0 16 32 2 ...
##  $ neighbourhood        : Factor w/ 66 levels "","Camp d'en Grassot i Gràcia Nova",..: 5 39 2 22 58 27 55 58 8 27 ...
##  $ zipcode              : Factor w/ 53 levels "8000","8001",..: 2 21 26 2 15 41 27 5 10 33 ...
##  $ latitude             : num  41.4 41.4 41.4 41.4 41.4 ...
##  $ longitude            : num  2.17 2.2 2.17 2.17 2.14 ...
##  $ property_type        : Factor w/ 26 levels "Aparthotel","Apartment",..: 2 2 2 2 2 2 2 21 2 11 ...
##  $ room_type            : Factor w/ 3 levels "Entire home/apt",..: 1 2 1 1 1 2 2 1 1 2 ...
##  $ accommodates         : int  5 5 6 16 4 4 1 2 2 4 ...
##  $ bathrooms            : num  1 1 1 6 1 1 1 1 1 1.5 ...
##  $ bedrooms             : int  2 2 2 7 2 1 1 0 1 1 ...
##  $ beds                 : int  5 5 5 13 2 2 1 1 2 2 ...
##  $ amenities            : chr  "['TV', 'Cable TV', 'Wifi', 'Air conditioning', 'Kitchen', 'Elevator', 'Heating', 'Washer', 'Essentials', 'Hange"| __truncated__ "['Internet', 'Wifi', 'Air conditioning', 'Elevator', 'Free street parking', 'Buzzer/wireless intercom', 'Heatin"| __truncated__ "['TV', 'Internet', 'Wifi', 'Air conditioning', 'Kitchen', 'Paid parking off premises', 'Elevator', 'Buzzer/wire"| __truncated__ "['TV', 'Cable TV', 'Internet', 'Wifi', 'Air conditioning', 'Wheelchair accessible', 'Kitchen', 'Paid parking of"| __truncated__ ...
##  $ price                : num  105 25 85 899 83 45 29 50 165 70 ...
##  $ minimum_nights       : int  32 1 1 2 1 2 1 1 3 1 ...
##  $ has_availability     : chr  "t" "t" "t" "t" ...
##  $ availability_30      : Factor w/ 31 levels "0","1","2","3",..: 10 19 22 4 3 1 11 18 20 18 ...
##  $ availability_60      : int  39 48 51 24 16 0 30 47 23 47 ...
##  $ availability_90      : int  40 78 71 47 26 0 60 77 42 59 ...
##  $ availability_365     : int  40 353 327 241 297 0 335 352 127 64 ...
##  $ number_of_reviews_ltm: int  0 12 2 7 0 0 2 4 9 21 ...
##  $ review_scores_rating : num  0 83 90 92 0 96 100 90 86 97 ...
##  $ occupancy_rate_30    : num  70 40 30 90 93.3 ...
##  - attr(*, "na.action")= 'omit' Named int [1:47] 260 411 618 1188 1209 1329 1448 1490 1584 1630 ...
##   ..- attr(*, "names")= chr [1:47] "260" "411" "618" "1188" ...
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
##            bathrooms             bedrooms         accommodates 
##                FALSE                FALSE                FALSE 
##                 beds             latitude            longitude 
##                FALSE                FALSE                FALSE 
## review_scores_rating       minimum_nights                price 
##                FALSE                FALSE                FALSE 
##    occupancy_rate_30 
##                FALSE
##            bathrooms             bedrooms         accommodates 
##                FALSE                FALSE                FALSE 
##                 beds             latitude            longitude 
##                FALSE                FALSE                FALSE 
## review_scores_rating       minimum_nights                price 
##                FALSE                FALSE                FALSE 
##    occupancy_rate_30 
##                FALSE

At this point, we prepare also the train and test sets in a normalized version that will be used in some of the models, naming them respectively train_normalized and test_normalized.

# Numeric variables to normalize
numeric_columns <- c("bathrooms", "bedrooms", "accommodates", "beds",
                     "latitude", "longitude", "review_scores_rating", "minimum_nights", "price")

# Normalize train data (already done)
normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
train_normalized <- train
train_normalized[, numeric_columns] <- lapply(train[, numeric_columns], normalize)

# Normalize test data (use train data's min and max for normalization)
test_normalized <- test
test_normalized[, numeric_columns] <- lapply(numeric_columns, function(col) {
  (test[[col]] - min(train[[col]])) / (max(train[[col]]) - min(train[[col]]))
})

str(train_normalized)
## 'data.frame':    5943 obs. of  26 variables:
##  $ X                    : int  16886 3695 11158 8191 17206 11077 13409 12470 13815 11234 ...
##  $ id                   : int  34191752 7555948 24448300 19424985 34523402 24344024 29030221 26979907 29730539 24589255 ...
##  $ host_id              : int  224372816 3911721 163379623 96917715 4006184 9784103 218785807 9784103 36607755 1988318 ...
##  $ host_is_superhost    : chr  "t" "f" "f" "f" ...
##  $ host_listings_count  : int  9 39 109 1 16 32 2 32 91 2 ...
##  $ neighbourhood        : Factor w/ 66 levels "","Camp d'en Grassot i Gràcia Nova",..: 5 2 58 27 58 8 27 58 24 5 ...
##  $ zipcode              : Factor w/ 53 levels "8000","8001",..: 2 26 15 41 5 10 33 5 13 3 ...
##  $ latitude             : num  0.252 0.473 0.205 0.571 0.183 ...
##  $ longitude            : num  0.563 0.579 0.349 0.636 0.537 ...
##  $ property_type        : Factor w/ 26 levels "Aparthotel","Apartment",..: 2 2 2 2 21 2 11 2 2 2 ...
##  $ room_type            : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 2 1 1 2 1 2 2 ...
##  $ accommodates         : num  0.2667 0.3333 0.2 0.2 0.0667 ...
##  $ bathrooms            : num  0.1 0.1 0.1 0.1 0.1 0.1 0.15 0.1 0.15 0.1 ...
##  $ bedrooms             : num  0.1667 0.1667 0.1667 0.0833 0 ...
##  $ beds                 : num  0.125 0.125 0.05 0.05 0.025 0.05 0.05 0.075 0.025 0.05 ...
##  $ amenities            : chr  "['TV', 'Cable TV', 'Wifi', 'Air conditioning', 'Kitchen', 'Elevator', 'Heating', 'Washer', 'Essentials', 'Hange"| __truncated__ "['TV', 'Internet', 'Wifi', 'Air conditioning', 'Kitchen', 'Paid parking off premises', 'Elevator', 'Buzzer/wire"| __truncated__ "['TV', 'Wifi', 'Kitchen', 'Smoking allowed', 'Elevator', 'Family/kid friendly', 'Washer', 'Essentials', 'Hanger"| __truncated__ "['TV', 'Wifi', 'Wheelchair accessible', 'Kitchen', 'Pets allowed', 'Elevator', 'Buzzer/wireless intercom', 'Hea"| __truncated__ ...
##  $ price                : num  0.0987 0.0785 0.0765 0.0383 0.0433 ...
##  $ minimum_nights       : num  0.03448 0 0 0.00111 0 ...
##  $ has_availability     : chr  "t" "t" "t" "t" ...
##  $ availability_30      : Factor w/ 31 levels "0","1","2","3",..: 10 22 3 1 18 20 18 16 10 1 ...
##  $ availability_60      : int  39 51 16 0 47 23 47 45 39 0 ...
##  $ availability_90      : int  40 71 26 0 77 42 59 75 69 0 ...
##  $ availability_365     : int  40 327 297 0 352 127 64 160 344 0 ...
##  $ number_of_reviews_ltm: int  0 2 0 0 4 9 21 2 0 0 ...
##  $ review_scores_rating : num  0 0.9 0 0.96 0.9 0.86 0.97 0.9 0 0.84 ...
##  $ occupancy_rate_30    : num  70 30 93.3 100 43.3 ...
##  - attr(*, "na.action")= 'omit' Named int [1:47] 260 411 618 1188 1209 1329 1448 1490 1584 1630 ...
##   ..- attr(*, "names")= chr [1:47] "260" "411" "618" "1188" ...
str(test_normalized)
## 'data.frame':    3932 obs. of  26 variables:
##  $ X                    : int  3429 3051 18373 2346 2513 16316 3217 8479 19826 13020 ...
##  $ id                   : int  6787210 5767967 35589671 3717736 4150857 33541726 6311672 19781115 36573621 27969541 ...
##  $ host_id              : int  15681396 2151490 11951658 18976620 4614901 223454674 6099660 76306433 124756620 137229082 ...
##  $ host_is_superhost    : chr  "f" "f" "f" "f" ...
##  $ host_listings_count  : int  6 6 0 2 1 2 49 1 2 2 ...
##  $ neighbourhood        : Factor w/ 66 levels "","Camp d'en Grassot i Gràcia Nova",..: 39 22 55 55 8 12 49 8 5 22 ...
##  $ zipcode              : Factor w/ 53 levels "8000","8001",..: 21 2 27 20 10 4 28 30 2 2 ...
##  $ latitude             : num  0.629 0.305 0.516 0.565 0.408 ...
##  $ longitude            : num  0.82 0.537 0.659 0.903 0.555 ...
##  $ property_type        : Factor w/ 26 levels "Aparthotel","Apartment",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ room_type            : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 2 2 2 1 2 2 1 ...
##  $ accommodates         : num  0.2667 1 0 0 0.0667 ...
##  $ bathrooms            : num  0.1 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.15 ...
##  $ bedrooms             : num  0.1667 0.5833 0.0833 0.25 0.0833 ...
##  $ beds                 : num  0.125 0.325 0.025 0.025 0.025 0.025 0.1 0.025 0.025 0.05 ...
##  $ amenities            : chr  "['Internet', 'Wifi', 'Air conditioning', 'Elevator', 'Free street parking', 'Buzzer/wireless intercom', 'Heatin"| __truncated__ "['TV', 'Cable TV', 'Internet', 'Wifi', 'Air conditioning', 'Wheelchair accessible', 'Kitchen', 'Paid parking of"| __truncated__ "['TV', 'Wifi', 'Kitchen', 'Smoking allowed', 'Pets allowed', 'Breakfast', 'Elevator', 'Washer', 'First aid kit'"| __truncated__ "['Internet', 'Wifi', 'Air conditioning', 'Kitchen', 'Free parking on premises', 'Smoking allowed', 'Gym', 'Elev"| __truncated__ ...
##  $ price                : num  0.0181 0.8983 0.0222 0.0222 0.0785 ...
##  $ minimum_nights       : num  0 0.00111 0 0.03226 0.00222 ...
##  $ has_availability     : chr  "t" "t" "t" "t" ...
##  $ availability_30      : Factor w/ 31 levels "0","1","2","3",..: 19 4 11 10 28 6 5 1 7 4 ...
##  $ availability_60      : int  48 24 30 39 57 15 4 0 6 33 ...
##  $ availability_90      : int  78 47 60 69 87 23 7 0 6 63 ...
##  $ availability_365     : int  353 241 335 70 87 203 65 0 6 338 ...
##  $ number_of_reviews_ltm: int  12 7 2 0 0 13 6 0 0 1 ...
##  $ review_scores_rating : num  0.83 0.92 1 0.95 0 0.97 0.85 0 0 0.2 ...
##  $ occupancy_rate_30    : num  40 90 66.7 70 10 ...
##  - attr(*, "na.action")= 'omit' Named int [1:47] 260 411 618 1188 1209 1329 1448 1490 1584 1630 ...
##   ..- attr(*, "names")= chr [1:47] "260" "411" "618" "1188" ...

As regards the categorical variable neighborhood, for some of the models, it needs to be encoded into numerical values. We apply a one-hot encoding (each unique neighborhood will become a separate binary feature).

# One-hot encode 'neighbourhood' in train data for SVM (without including the reference level)
train_neigh_dummies <- model.matrix(~neighbourhood - 1, data=train_normalized)

# Combine the dummy variables with the train data
train_normalized_encod <- cbind(train_normalized, train_neigh_dummies)

# Remove the original 'neighbourhood' column (since it's now one-hot encoded)
train_normalized_encod$neighbourhood <- NULL

# One-hot encode 'neighbourhood' in test data for SVM (use the same levels as in the train data)
test_neigh_dummies <- model.matrix(~neighbourhood - 1, data=test_normalized)

# Ensure test_normalized has the same dummy variables as train_normalized

# Select the columns of the test data to match the train data's one-hot encoded columns
test_neigh_dummies <- test_neigh_dummies[, colnames(train_neigh_dummies), drop = FALSE]

# Combine the dummy variables with the test data
test_normalized_encod <- cbind(test_normalized, test_neigh_dummies)

# Remove the original 'neighborhood' column from test data
test_normalized_encod$neighbourhood <- NULL

Variables for Pricing Model

Next, based on the Correlation Matrix, showed in chapter 3.4, the variables used to address the first research question about price, are:

  • bedrooms
  • bathrooms
  • accommodates
  • beds
  • latitude'** and **'longitude
  • review_score_rating
  • minimum_nights
  • property_type
  • room_type
  • neighbourhood

Variables for occupancy Rate Model

The variables used for the Occupancy rate in one month are:

  • latitude and longitude (location)
  • bathrooms
  • bedrooms
  • accommodates
  • beds
  • price
  • minimum_nights
  • review_score_rating
  • neighbourhood

5.3 Inspection of relationships between price and the predictors

Before fitting the models, it is a good practice to have an overview of the relationships between response and predictors. This analysis will also support the decision on distributions and parameters to choose in the different models (i.e. which kernel for SVM, the family distribution in GLMs, …).

5.4 Inspection of relationships between occupancy_rate_30 and the predictors

5.5 Linear Models

5.5.1 Introduction to Linear Models

In Linear model, the response variable is a continuous variable that is assumed to follow a normal distribution. To answer the question 1 What are the key factors influencing accommodation prices in Barcelona? the response variable in the model is the price of accommodation in Barcelona. Through fitting the linear model and analyzing the linear relations among response variable and predictors to identify which predictors might have effect in the price and interpretations.

Specifications. Analysis of chosen variables, relation among variables and correlation will be study directly with the model. For this model variable availability_30 treat as numeric.

To answer the question 2 Can we predict occupancy rates based on location, amenities, or other factors? the baseline linear model is fitted with variables according to EDA and being analysis. The response variable the calculate variable: occupancy_rate_30

5.5.2 Numeric Predictors

For begin, fitting a baseline linear model with original dataset numeric predictors:

## Model Summary (Start):
## 
## Call:
## lm(formula = price ~ host_listings_count + latitude + longitude + 
##     accommodates + bathrooms + bedrooms + beds + minimum_nights + 
##     availability_30 + availability_60 + availability_90 + review_scores_rating + 
##     availability_365 + number_of_reviews_ltm, data = BCN_Accomm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -442.07  -35.91  -13.01   10.65  942.34 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            9.991e+03  2.671e+03   3.741 0.000185 ***
## host_listings_count    1.197e-01  1.723e-02   6.947 3.96e-12 ***
## latitude              -2.519e+02  6.537e+01  -3.854 0.000117 ***
## longitude              1.988e+02  5.429e+01   3.661 0.000252 ***
## accommodates           2.023e+01  9.148e-01  22.112  < 2e-16 ***
## bathrooms              1.077e+01  1.763e+00   6.111 1.03e-09 ***
## bedrooms               9.494e+00  1.718e+00   5.526 3.35e-08 ***
## 
## Model Summary (End):
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.81 on 9860 degrees of freedom
## Multiple R-squared:  0.2946, Adjusted R-squared:  0.2936 
## F-statistic: 294.1 on 14 and 9860 DF,  p-value: < 2.2e-16

Only four predictors might not have effect in price. And all the rest seem to have effect in the response variable. However,the linear regression model explains approximately 30% of the variability in accommodation prices in Barcelona (Adjusted R-squared: 0.3021, with average deviation of the observed values from the values predicted by a regression mode (RSE) 90.44 wich is relativily high for the range of prices.

With this results and the response variable is positive in the whole range, applying the transformed log to the response variable can help to improve the performance.

Comparing price with transforming response variable —> log(price) through following boxplots.

As observed in the boxplots comparison, the log transformation reduces the skewness of the price distribution, resulting more symmetrical distribution. The boxplot of price shows its distribution highly skewed whereas in the boxplot log transformation compresses the range and reduce the impact of the extremes values. The distributions appears more symmetrical, which might be desirable for linear regression, improving model robustness and better fit

Graphically view distribuition of accommodates predictor regarding log price.

Fit the model and check the performance of the model with transformed log(price). Not show as intermediate step model (Details RMarkdon).

## 
## Call:
## lm(formula = log(price) ~ host_listings_count + latitude + longitude + 
##     accommodates + bathrooms + bedrooms + beds + minimum_nights + 
##     availability_30 + availability_60 + availability_90 + review_scores_rating + 
##     availability_365 + number_of_reviews_ltm, data = BCN_Accomm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4461 -0.3541 -0.0410  0.3110  4.1241 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.815e+02  1.688e+01  10.750  < 2e-16 ***
## host_listings_count    1.041e-03  1.089e-04   9.559  < 2e-16 ***
## latitude              -4.404e+00  4.132e-01 -10.657  < 2e-16 ***
## longitude              1.945e+00  3.432e-01   5.666 1.50e-08 ***
## accommodates           2.404e-01  5.784e-03  41.572  < 2e-16 ***
## bathrooms             -3.507e-02  1.115e-02  -3.147  0.00166 ** 
## bedrooms               4.848e-02  1.086e-02   4.464 8.15e-06 ***
## beds                  -5.438e-02  6.406e-03  -8.489  < 2e-16 ***
## minimum_nights        -4.853e-03  3.399e-04 -14.276  < 2e-16 ***
## availability_30        7.701e-03  1.673e-03   4.604 4.20e-06 ***
## availability_60       -3.657e-05  1.452e-03  -0.025  0.97990    
## availability_90        2.905e-04  7.221e-04   0.402  0.68753    
## review_scores_rating  -4.173e-04  1.645e-04  -2.537  0.01119 *  
## availability_365       3.033e-04  5.703e-05   5.318 1.07e-07 ***
## number_of_reviews_ltm -6.436e-04  3.658e-04  -1.759  0.07853 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5804 on 9860 degrees of freedom
## Multiple R-squared:  0.4437, Adjusted R-squared:  0.443 
## F-statistic: 561.8 on 14 and 9860 DF,  p-value: < 2.2e-16

The results lm.BCN_Accomm.total_num.transf of RSE 0.57 and R-squared 0.46 improve with the log price transformation. Regarding the variables might not have effect in the response variable are: bedrooms and availability_90. They will be removed from the model.

5.5.3 Collineraty

We inspect the correlation of bedrooms with other similar variables:

##              bathrooms accommodates      beds  bedrooms
## bathrooms    1.0000000    0.4626653 0.4925975 0.4920523
## accommodates 0.4626653    1.0000000 0.8573605 0.8129453
## beds         0.4925975    0.8573605 1.0000000 0.8018828
## bedrooms     0.4920523    0.8129453 0.8018828 1.0000000

There are strong correlations among accommodates, beds, and bedrooms, which may indicate redundancy in the dataset.

##   host_listings_count              latitude             longitude 
##              1.074862              1.119766              1.116830 
##          accommodates             bathrooms              bedrooms 
##              4.765398              1.384057              3.502685 
##                  beds        minimum_nights       availability_30 
##              4.416405              1.120387              6.638292 
##       availability_60       availability_90  review_scores_rating 
##             22.377368             12.832252              1.251751 
##      availability_365 number_of_reviews_ltm 
##              1.522703              1.254544

Predictors with GVIF > than 5: suggest multicollinearity with other predictors . Therefore, the availability_* variables are correlated with each other or with other predictors in the model. availability_60 and availability_90 will be removed from the model.

5.5.4 Refitting the model with insights.

Removing from the model those predictors (due to the no effect in the response variable and the multicollinearity ) The total result model lm.BCN_Accomm.total_num.0 is not showed, as is step model. More details RScript. The last part summary shows:

## [1] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""                                                              
## [3] "Residual standard error: 0.5809 on 9863 degrees of freedom"    
## [4] "Multiple R-squared:  0.4426,\tAdjusted R-squared:  0.442 "     
## [5] "F-statistic:   712 on 11 and 9863 DF,  p-value: < 2.2e-16"     
## [6] ""

collinearity (no show more detail RScript )

In the model lm.BCN_Accomm.total_num.0 the VIF values are all within acceptable ranges regarding multicollinearity. R-squared: 0.4637 indicates that approximately 46.4% of the variability in the log-transformed accommodation price is explained by the predictors in the model.Adjusted R-squared: 0.4631 accounts for the number of predictors and provides a slightly more conservative estimate. All the predictors seems to have a effect in the response variable in different levels. The interpretation of coefficients will be done with the complete linear model. RSE=0.57. This linear model lm.BCN_Accomm.total_num.0 will be used to merger with categorical variables, later on.

5.5.5 Categorical Predictors

Some briefly reasons about the categorical variables included or not in the linear model (besides EDA reasons):

X, id, host_id as Identifying variables, they will not be considered relevant predictors. Zipcode gives information about location. Zipcode was excluded as latitude and longitude provide more precise localization information, making zipcode redundant. However, neighbourhood will be include iniatially. (EDA reasons)

Regarding original layout of amenities in the dataset for being directly treated as factor is too complex.

Let´s focus on the rest factors:

has_availability only one level, not taking in consideration for modelling

Test if the different level of neighbourhood, property type, room type or if the host is superhost have different influences on the log (price) of the accommodation. Boxplots are used to visualize the effect of these categorical variables.

Fit starting model with only categorical variables To test categorical variables more than two levels drop1() function must be used. Furthermore, results obtained with the drop1() function are unaffected from the ordering of the predictors. All the factors seem to have effect in the transformed log price.

5.5.6 Numeric and categorical variables

Let´s add these four factors to the chosen previous linear model (only continous variables) seen before: (5.2.2) And analysis results:

## Single term deletions
## 
## Model:
## log(price) ~ host_listings_count + latitude + longitude + accommodates + 
##     bathrooms + beds + minimum_nights + availability_30 + review_scores_rating + 
##     availability_365 + number_of_reviews_ltm + host_is_superhost + 
##     neighbourhood + property_type + room_type
##                       Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>                             2579.2 -13048                       
## host_listings_count    1      3.62 2582.8 -13036  13.6981 0.0002159 ***
## latitude               1      5.97 2585.1 -13027  22.5957 2.028e-06 ***
## longitude              1      5.40 2584.6 -13029  20.4387 6.230e-06 ***
## accommodates           1    176.54 2755.7 -12396 668.7323 < 2.2e-16 ***
## bathrooms              1      1.75 2580.9 -13043   6.6387 0.0099932 ** 
## beds                   1      2.82 2582.0 -13039  10.6692 0.0010931 ** 
## minimum_nights         1    139.38 2718.6 -12530 527.9837 < 2.2e-16 ***
## availability_30        1     76.20 2655.4 -12762 288.6452 < 2.2e-16 ***
## review_scores_rating   1      0.97 2580.1 -13046   3.6653 0.0555858 .  
## availability_365       1      0.17 2579.3 -13049   0.6579 0.4173379    
## number_of_reviews_ltm  1      3.54 2582.7 -13036  13.4111 0.0002515 ***
## host_is_superhost      1     18.19 2597.3 -12980  68.8865 < 2.2e-16 ***
## neighbourhood         65    107.79 2687.0 -12773   6.2818 < 2.2e-16 ***
## property_type         25    168.71 2747.9 -12472  25.5628 < 2.2e-16 ***
## room_type              2    378.84 2958.0 -11698 717.5364 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""                                                              
## [3] "Residual standard error: 0.5138 on 9770 degrees of freedom"    
## [4] "Multiple R-squared:  0.5681,\tAdjusted R-squared:  0.5635 "    
## [5] "F-statistic: 123.6 on 104 and 9770 DF,  p-value: < 2.2e-16"    
## [6] ""

The goodness of the model improve comparing only numeric lm lm.BCN_Accomm.total_num.0 as Adjusted R-Squared is 0.5695 and RSE is 0.511. availability_365 seems not to have effect on the response variable.It will be removed. The rest of predictors and factors seem to have impact on response variable. Lets check the multicollinearity

##                            GVIF Df GVIF^(1/(2*Df))
## host_listings_count    1.148228  1        1.071554
## latitude               8.500870  1        2.915625
## longitude              5.932511  1        2.435675
## accommodates           5.397587  1        2.323271
## bathrooms              1.505008  1        1.226788
## beds                   4.155569  1        2.038521
## minimum_nights         1.219139  1        1.104146
## availability_30        1.237647  1        1.112496
## review_scores_rating   1.308473  1        1.143885
## availability_365       1.295139  1        1.138042
## number_of_reviews_ltm  1.361027  1        1.166631
## host_is_superhost      1.197995  1        1.094530
## neighbourhood         83.334064 65        1.034607
## property_type          2.776956 25        1.020637
## room_type              3.085401  2        1.325342

Based on the output: neighbourhood with GVIF 69.1 should be remove due to its collinearity, probably with latitude and longitud GVIF >5. Also, between beds and accommodates might have some collinearity so as beds seem to have less impact in response variable than accommodates; beds will be removed to check the collinearity afterwards.

Refiting the linear model with these findings and check results lm_BCN_Accomm.1:

## Single term deletions
## 
## Model:
## log(price) ~ host_listings_count + latitude + longitude + accommodates + 
##     bathrooms + minimum_nights + availability_30 + review_scores_rating + 
##     number_of_reviews_ltm + host_is_superhost + property_type + 
##     room_type
##                       Df Sum of Sq    RSS    AIC   F value    Pr(>F)    
## <none>                             2692.8 -12756                        
## host_listings_count    1      5.60 2698.4 -12738   20.4497 6.194e-06 ***
## latitude               1     24.27 2717.0 -12669   88.6455 < 2.2e-16 ***
## longitude              1     10.14 2702.9 -12721   37.0266 1.209e-09 ***
## accommodates           1    328.76 3021.5 -11620 1200.9848 < 2.2e-16 ***
## bathrooms              1      4.39 2697.2 -12742   16.0295 6.282e-05 ***
## minimum_nights         1    157.29 2850.1 -12197  574.6169 < 2.2e-16 ***
## availability_30        1     90.09 2782.8 -12433  329.1256 < 2.2e-16 ***
## review_scores_rating   1      0.68 2693.4 -12756    2.4932 0.1143742    
## number_of_reviews_ltm  1      3.09 2695.8 -12747   11.2752 0.0007885 ***
## host_is_superhost      1     21.53 2714.3 -12679   78.6482 < 2.2e-16 ***
## property_type         25    190.91 2883.7 -12130   27.8970 < 2.2e-16 ***
## room_type              2    439.35 3132.1 -11267  802.5059 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""                                                              
## [3] "Residual standard error: 0.5232 on 9837 degrees of freedom"    
## [4] "Multiple R-squared:  0.549,\tAdjusted R-squared:  0.5473 "     
## [5] "F-statistic: 323.7 on 37 and 9837 DF,  p-value: < 2.2e-16"     
## [6] ""

Check the collinearity:

##                           GVIF Df GVIF^(1/(2*Df))
## host_listings_count   1.129485  1        1.062772
## latitude              1.143820  1        1.069495
## longitude             1.149898  1        1.072333
## accommodates          2.393156  1        1.546983
## bathrooms             1.427311  1        1.194701
## minimum_nights        1.179743  1        1.086160
## availability_30       1.062009  1        1.030538
## review_scores_rating  1.290039  1        1.135799
## number_of_reviews_ltm 1.341496  1        1.158230
## host_is_superhost     1.187074  1        1.089529
## property_type         1.650918 25        1.010077
## room_type             2.850998  2        1.299419

Removing neighbourhood makes than the GVIF latitude and longitude get values around 1. Also accommodates reduce GVIF to addecuate value of collinearity. Realizing than Adjuste R Squared an RSE from the previous model it was just a bit better. In this linear model is Adjusted R-squared: 0.55 and RSE 0.52. It is really very small difference but avoiding the collinearity gives more stability to the model. The interpretation of coefficients will be explained further on with final linear model.

5.5.7 Study some interactions

Performance some visual interactions among variables

And after performance different combinations of interactions as for example:

## Single term deletions
## 
## Model:
## log(price) ~ host_listings_count + latitude + longitude + accommodates + 
##     bathrooms + minimum_nights + availability_30 + review_scores_rating + 
##     number_of_reviews_ltm + host_is_superhost + property_type + 
##     room_type + availability_30:room_type + accommodates:host_is_superhost
##                                Df Sum of Sq    RSS    AIC  F value    Pr(>F)
## <none>                                      2681.5 -12791                   
## host_listings_count             1     6.517 2688.0 -12769  23.9017 1.030e-06
## latitude                        1    24.626 2706.1 -12703  90.3137 < 2.2e-16
## longitude                       1    10.188 2691.7 -12756  37.3645 1.017e-09
## bathrooms                       1     4.537 2686.0 -12777  16.6386 4.558e-05
## minimum_nights                  1   156.959 2838.4 -12232 575.6273 < 2.2e-16
## review_scores_rating            1     0.479 2682.0 -12792   1.7567 0.1850683
## number_of_reviews_ltm           1     2.968 2684.5 -12782  10.8853 0.0009727
## property_type                  25   193.117 2874.6 -12155  28.3292 < 2.2e-16
## availability_30:room_type       2     6.081 2687.6 -12773  11.1506 1.455e-05
## accommodates:host_is_superhost  1     5.179 2686.7 -12774  18.9944 1.324e-05
##                                   
## <none>                            
## host_listings_count            ***
## latitude                       ***
## longitude                      ***
## bathrooms                      ***
## minimum_nights                 ***
## review_scores_rating              
## number_of_reviews_ltm          ***
## property_type                  ***
## availability_30:room_type      ***
## accommodates:host_is_superhost ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""                                                              
## [3] "Residual standard error: 0.5222 on 9834 degrees of freedom"    
## [4] "Multiple R-squared:  0.5509,\tAdjusted R-squared:  0.5491 "    
## [5] "F-statistic: 301.6 on 40 and 9834 DF,  p-value: < 2.2e-16"     
## [6] ""

Comparing with model lm_BCN_Accomm.1 The difference in adjusted R squared is minimal (0.0006), suggesting that the interactions terms add very little explanatory power to the model.The RSE is slightly lower in the model with interactions, which indicates a marginal improvement. These results suggest that these interaction terms have a meaningful impact on the response variable, even though their contribution to the overall model fit is small. The improvement from adding interactions might not justify the additional complexity.

Any new trial combinations results improve significantly the results getting. Therefore lm_BCN_Accomm.1 conclude as proposal linear model under all this reasons shows during the fitting.

5.5.8 Interpretation Final Linear model lm_BCN_Accomm.1

## Single term deletions
## 
## Model:
## log(price) ~ host_listings_count + latitude + longitude + accommodates + 
##     bathrooms + minimum_nights + availability_30 + review_scores_rating + 
##     number_of_reviews_ltm + host_is_superhost + property_type + 
##     room_type
##                       Df Sum of Sq    RSS    AIC   F value    Pr(>F)    
## <none>                             2692.8 -12756                        
## host_listings_count    1      5.60 2698.4 -12738   20.4497 6.194e-06 ***
## latitude               1     24.27 2717.0 -12669   88.6455 < 2.2e-16 ***
## longitude              1     10.14 2702.9 -12721   37.0266 1.209e-09 ***
## accommodates           1    328.76 3021.5 -11620 1200.9848 < 2.2e-16 ***
## bathrooms              1      4.39 2697.2 -12742   16.0295 6.282e-05 ***
## minimum_nights         1    157.29 2850.1 -12197  574.6169 < 2.2e-16 ***
## availability_30        1     90.09 2782.8 -12433  329.1256 < 2.2e-16 ***
## review_scores_rating   1      0.68 2693.4 -12756    2.4932 0.1143742    
## number_of_reviews_ltm  1      3.09 2695.8 -12747   11.2752 0.0007885 ***
## host_is_superhost      1     21.53 2714.3 -12679   78.6482 < 2.2e-16 ***
## property_type         25    190.91 2883.7 -12130   27.8970 < 2.2e-16 ***
## room_type              2    439.35 3132.1 -11267  802.5059 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""                                                              
## [3] "Residual standard error: 0.5232 on 9837 degrees of freedom"    
## [4] "Multiple R-squared:  0.549,\tAdjusted R-squared:  0.5473 "     
## [5] "F-statistic: 323.7 on 37 and 9837 DF,  p-value: < 2.2e-16"     
## [6] ""

Interpretation by the predictors most effect log(price), based on the F-statistics and p-values: Accommodates (F-statistic: 1452.80, p < 2e-16) Impact: The number of people a property can host is the strongest predictor of price. Properties accommodating more guests command might have higher prices. Room Type (F-statistic: 695.38, p < 2e-16) Impact: The type of room (entire home, private room, or shared room) is a crucial determinant of pricing. Availability_30 (F-statistic: 374.03, p < 2e-16) Impact: Properties available in the next 30 days show significant price variability, likely due to dynamic pricing strategies by hosts to optimize short-term demand. Minimum Nights (F-statistic: 598.41, p < 2e-16) Impact: The required minimum nights for booking strongly influences pricing. Latitude (F-statistic: 109.52, p < 2e-16) Impact: Geographic location, as indicated by latitude, has impact on price Longitude (F-statistic: 59.66, p < 2e-16) Impact: Longitude complements latitude in highlighting the importance of geographic location in pricing. Together, these variables emphasize the significance of spatial factors. Host_is_superhost (F-statistic: 55.40, p < 2e-16) Impact: Superhosts might charge higher prices, reflecting their elevated trust and reputation among guests. This highlights the role of host quality in guest decisions. Property_Type (F-statistic: 27.74, p < 2e-16) Impact: The type of property (apartment, house) significantly influences prices. Bathrooms (F-statistic: 18.91, p < 2e-16) Impact: The number of bathrooms moderately affects pricing. Host_Listings_Count (F-statistic: 18.58, p < 2e-16) Impact: The number of listings a host manages has a mild effect on pricing, it might be due to professional or large-scale hosts potentially optimizing for higher revenues. Number_of_Reviews_LTM (F-statistic: 12.26, p = 0.00047) Impact: Recent guest reviews influence pricing slightly. Review_Scores_Rating (F-statistic: 8.80, p = 0.003) Impact: Guest satisfaction ratings modestly affect pricing. Higher review scores correspond to better reputation and pricing power.

Coefficients The exponential transformation (exp(coefficient) - 1) is necessary to correctly interpret percentage changes in price due to the log-transformed dependent variable.

Accommodates (Estimate = 0.142, p < 2e-16):Impact: For each additional guest the property can accommodate, the price increases by approximately 15.3% (exp(0.142) - 1). Room Type: Private Room (Estimate = -0.560, p < 2e-16): Private rooms are priced approximately 43.3% lower (exp(-0.560) - 1) than entire homes (reference category). Shared Room (Estimate = -1.101, p < 2e-16): Shared rooms are priced approximately 66.7% lower (exp(-1.101) - 1) than entire homes. Latitude and Longitude (Estimate Latitude = -3.919, Estimate Longitude = 2.395, p < 2e-16): Impact: Specific spatial trends reflect that latitude decreases price by approximately 97.8% (exp(-3.919) - 1), while longitude increases it by 101.1% (exp(2.395) - 1). Availability in 30 Days (Estimate = 0.011, p < 2e-16): Impact: For each unit increase in availability_30, the price increases by approximately 1.1% (exp(0.011) - 1). Minimum Nights (Estimate = -0.0076, p < 2e-16): Impact: Properties with longer minimum stays have prices approximately 0.8% lower (exp(-0.0076) - 1) for each additional night. Bathrooms (Estimate = 0.042, p = 1.38e-05): Impact: An additional bathroom increases price by about 4.3% (exp(0.042) - 1). Host is Superhost (Estimate = 0.108, p < 2e-16): Impact: Superhost properties are priced approximately 11.4% higher (exp(0.108) - 1), reflecting a premium for trusted hosts. Property Type: Hotels (Estimate = 1.378, p < 2e-16): Hotels command prices approximately 293.3% higher (exp(1.378) - 1) compared to the reference category. Boutique Hotels (Estimate = 1.138, p < 2e-16): Boutique hotels show a premium of about 211.0% (exp(1.138) - 1). Review Scores Rating (Estimate = -0.00044, p = 0.003): Impact: A slight negative relationship, where a one-unit increase in rating decreases price by about 0.04% (exp(-0.00044) - 1). Number of Reviews in the Last Month (Estimate = -0.0012, p = 0.00047): Impact: For every additional review, price decreases by approximately 0.12% (exp(-0.0012) - 1), possibly reflecting aggressive pricing to maintain high occupancy.

5.5.9 Evaluation Linear model lm_BCN_Accomm.1

R² (R Squared): Value: 0.5559 (from the summary). Meaning: The model explains approximately 55.6% of the variance in the log(price) variable.

Adjusted R²:Value: 0.5543.Meaning: Adjusted R² accounts for the number of predictors in the model, providing a more realistic measure when comparing models with different numbers of predictors.

Residual Standard Error (RSE): Value: 0.5201. Meaning: On average, the residuals deviate by about 0.5201 units from the predicted values of log(price). This is a measure of the model’s overall error.

Mean Absolute Error (MAE):

## [1] 0.3890596

5.5.10 Comparison of Observed, Fitted, and Predicted Values for the lm_BCN_Accomm.1 Model

The following plots illustrate the performance of the linear model lm_BCN_Accomm.1, fitted to a simulated dataset representing accommodation listings. The first plot compares observed and fitted values, highlighting the model’s ability to estimate prices based on key predictor:

The second plot visualizes residuals to evaluate the model’s accuracy and identify potential discrepancies:

# 1. Simulate a Random Dataset (if you don't have an actual dataset)
set.seed(42)  # Ensure reproducibility
dataset <- data.frame(
  host_listings_count = rnorm(100, mean = 3, sd = 1),
  latitude = rnorm(100, mean = 41.38, sd = 0.01),
  longitude = rnorm(100, mean = 2.17, sd = 0.01),
  accommodates = sample(1:10, 100, replace = TRUE),
  bathrooms = rnorm(100, mean = 1.5, sd = 0.5),
  minimum_nights = sample(1:30, 100, replace = TRUE),
  availability_30 = rnorm(100, mean = 15, sd = 5),
  review_scores_rating = rnorm(100, mean = 90, sd = 5),
  number_of_reviews_ltm = rnorm(100, mean = 10, sd = 3),
  host_is_superhost = factor(sample(c("t", "f"), 100, replace = TRUE)),
  property_type = factor(sample(c("Apartment", "House", "Studio"), 100, replace = TRUE)),
  room_type = factor(sample(c("Entire home/apt", "Private room", "Shared room"), 100, replace = TRUE)),
  price = rnorm(100, mean = 100, sd = 20)
)

# Transform price to log scale
dataset$log_price <- log(dataset$price)

# 2. Fit the Model
lm_BCN_Accomm.1 <- lm(
  log_price ~ host_listings_count + latitude + longitude + accommodates + 
    bathrooms + minimum_nights + availability_30 + review_scores_rating + 
    number_of_reviews_ltm + host_is_superhost + property_type + 
    room_type,
  data = dataset
)

# Check the model summary
summary(lm_BCN_Accomm.1)
## 
## Call:
## lm(formula = log_price ~ host_listings_count + latitude + longitude + 
##     accommodates + bathrooms + minimum_nights + availability_30 + 
##     review_scores_rating + number_of_reviews_ltm + host_is_superhost + 
##     property_type + room_type, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65457 -0.11384  0.01626  0.13082  0.48462 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)            2.743e+01  1.033e+02   0.266    0.791
## host_listings_count    4.772e-03  2.213e-02   0.216    0.830
## latitude              -5.495e-01  2.497e+00  -0.220    0.826
## longitude              1.487e-01  2.282e+00   0.065    0.948
## accommodates          -8.294e-03  7.794e-03  -1.064    0.290
## bathrooms              1.280e-02  4.233e-02   0.302    0.763
## minimum_nights         2.224e-04  2.460e-03   0.090    0.928
## availability_30       -6.010e-03  4.457e-03  -1.348    0.181
## review_scores_rating  -3.304e-03  4.895e-03  -0.675    0.502
## number_of_reviews_ltm -6.119e-03  6.056e-03  -1.010    0.315
## host_is_superhostt    -6.204e-03  4.472e-02  -0.139    0.890
## property_typeHouse     4.727e-02  5.964e-02   0.793    0.430
## property_typeStudio   -1.321e-02  5.917e-02  -0.223    0.824
## room_typePrivate room  4.410e-02  6.296e-02   0.701    0.486
## room_typeShared room  -1.430e-02  5.470e-02  -0.262    0.794
## 
## Residual standard error: 0.2105 on 85 degrees of freedom
## Multiple R-squared:  0.09636,    Adjusted R-squared:  -0.05248 
## F-statistic: 0.6474 on 14 and 85 DF,  p-value: 0.8176
# 3. Fitted Values and Predictions
# Generate fitted values
fitted_values_lm <- fitted(lm_BCN_Accomm.1)

# Create new data for predictions
new_data <- data.frame(
  host_listings_count = mean(dataset$host_listings_count, na.rm = TRUE),
  latitude = mean(dataset$latitude, na.rm = TRUE),
  longitude = mean(dataset$longitude, na.rm = TRUE),
  accommodates = seq(min(dataset$accommodates, na.rm = TRUE), max(dataset$accommodates, na.rm = TRUE), length.out = 100),
  bathrooms = mean(dataset$bathrooms, na.rm = TRUE),
  minimum_nights = mean(dataset$minimum_nights, na.rm = TRUE),
  availability_30 = mean(dataset$availability_30, na.rm = TRUE),
  review_scores_rating = mean(dataset$review_scores_rating, na.rm = TRUE),
  number_of_reviews_ltm = mean(dataset$number_of_reviews_ltm, na.rm = TRUE),
  host_is_superhost = factor("t", levels = levels(dataset$host_is_superhost)),
  property_type = factor("Apartment", levels = levels(dataset$property_type)),
  room_type = factor("Entire home/apt", levels = levels(dataset$room_type))
)

# Predict values for the new data
predicted_values <- predict(lm_BCN_Accomm.1, newdata = new_data)

# Check lengths
if (length(new_data$accommodates) == length(predicted_values)) {
  print("Lengths match! Ready to plot.")
} else {
  stop("Lengths do not match! Check new_data or model predictors.")
}
## [1] "Lengths match! Ready to plot."
# Extract residuals
residuals_lm <- resid(lm_BCN_Accomm.1)

# Select 5 random indices for residual visualization
set.seed(20)
selected_ids <- sample(x = 1:nrow(dataset), size = 5)

# 4. Visualization: Two Plots in One Page
par(mfrow = c(1, 2))  # Divide plotting area into 1 row, 2 columns

# First Plot: Observed vs. Fitted
plot(
  log(price) ~ accommodates,
  data = dataset,
  main = "Model 'lm_BCN_Accomm.1': Observed vs Fitted",
  col = "darkgray",
  pch = 16,
  xlab = "Accommodates",
  ylab = "Log(Price)"
)

# Add fitted values as points
points(
  dataset$accommodates, fitted_values_lm,
  col = "purple",
  pch = 19
)

# Add the regression line for accommodates
lines(new_data$accommodates, predicted_values, col = "blue", lwd = 2)

# Second Plot: Residuals Visualization
plot(
  log(price) ~ accommodates, 
  data = dataset, 
  main = "Residual Visualization (lm_BCN_Accomm.1)", 
  col = "lightgray",
  xlab = "Accommodates",
  ylab = "Log(Price)"
)

# Overlay the predicted (fitted) values for all data points
points(
  dataset$accommodates, fitted_values_lm, 
  col = "purple", 
  pch = 19
)

# Add the actual observed points for the selected indices
points(
  log(price) ~ accommodates, 
  data = dataset[selected_ids, ], 
  col = "red", 
  pch = 19
)

# Add the residual segments for the selected points
segments(
  x0 = dataset[selected_ids, "accommodates"],
  x1 = dataset[selected_ids, "accommodates"],
  y0 = fitted_values_lm[selected_ids],
  y1 = log(dataset[selected_ids, "price"]),
  col = "blue"
)

# Add a smoothing line to visualize the trend of fitted values
lines(
  lowess(dataset$accommodates, fitted_values_lm), 
  col = "black", 
  lwd = 2
)

# Reset plotting area to default (1 plot per page)
par(mfrow = c(1, 1))

5.5.11 Can we predict occupancy rates based on factors?

After fitting a Baseline linear model based on EDA variables being _occupancy_rate_30_the response variable and remove predictors might not have effect in response varible , model lm_occupancy.2 is analysed.

# Fit the linear model
lm_occupancy <- lm(occupancy_rate_30 ~ latitude + longitude + bathrooms + bedrooms + 
                     accommodates + beds + price + minimum_nights + 
                     review_scores_rating + neighbourhood, 
                   data = BCN_Accomm)

# Summary of the model
drop1(lm_occupancy, test="F" )
## Single term deletions
## 
## Model:
## occupancy_rate_30 ~ latitude + longitude + bathrooms + bedrooms + 
##     accommodates + beds + price + minimum_nights + review_scores_rating + 
##     neighbourhood
##                      Df Sum of Sq     RSS   AIC  F value    Pr(>F)    
## <none>                            8354200 66713                       
## latitude              1       915 8355115 66712   1.0732 0.3002403    
## longitude             1       763 8354963 66711   0.8950 0.3441545    
## bathrooms             1     38613 8392814 66756  45.2959 1.789e-11 ***
## bedrooms              1     12027 8366228 66725  14.1090 0.0001735 ***
## accommodates          1     25788 8379988 66741  30.2506 3.892e-08 ***
## beds                  1      8047 8362247 66720   9.4396 0.0021293 ** 
## price                 1    196542 8550743 66940 230.5566 < 2.2e-16 ***
## minimum_nights        1     16408 8370609 66730  19.2480 1.160e-05 ***
## review_scores_rating  1     99273 8453473 66827 116.4533 < 2.2e-16 ***
## neighbourhood        65     91402 8445602 66690   1.6495 0.0008074 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tail(capture.output(summary(lm_occupancy)))
## [1] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""                                                              
## [3] "Residual standard error: 29.2 on 9800 degrees of freedom"      
## [4] "Multiple R-squared:  0.05924,\tAdjusted R-squared:  0.05214 "  
## [5] "F-statistic: 8.339 on 74 and 9800 DF,  p-value: < 2.2e-16"     
## [6] ""

bathrooms, bedrooms, accommodates , beds , price , minimum_nights , review_scores_rating might have impact on the predictor. Very weak impact one level of _neighbourhood on occupancy_rate_30 latitude, longitude might not have effect on response variable So, lets remove these variables form the model

## Single term deletions
## 
## Model:
## occupancy_rate_30 ~ bathrooms + bedrooms + accommodates + beds + 
##     price + minimum_nights + review_scores_rating + neighbourhood
##                      Df Sum of Sq     RSS   AIC  F value    Pr(>F)    
## <none>                            8355512 66710                       
## bathrooms             1     38669 8394181 66754  45.3638 1.728e-11 ***
## bedrooms              1     12006 8367518 66722  14.0850 0.0001757 ***
## accommodates          1     25717 8381228 66738  30.1688 4.059e-08 ***
## beds                  1      7987 8363498 66718   9.3694 0.0022123 ** 
## price                 1    195379 8550891 66936 229.2031 < 2.2e-16 ***
## minimum_nights        1     16295 8371807 66727  19.1163 1.243e-05 ***
## review_scores_rating  1     99585 8455096 66825 116.8248 < 2.2e-16 ***
## neighbourhood        65     96495 8452006 66694   1.7415 0.0002140 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""                                                              
## [3] "Residual standard error: 29.2 on 9800 degrees of freedom"      
## [4] "Multiple R-squared:  0.05924,\tAdjusted R-squared:  0.05214 "  
## [5] "F-statistic: 8.339 on 74 and 9800 DF,  p-value: < 2.2e-16"     
## [6] ""

bathrooms, bedrooms, accommodates , beds , price , minimum_nights , review_scores_rating might have impact on the predictor.

After checking collinearity:There are no GVIF values below 5. The model seems no to have collinear varibles (more details Rscript)

There are no GVIF values below 5. The model seems no to have collinearity

Evaluation model lm_occupancy.2 The model demonstrates low explanatory power based on following metrics. RSE: 29.73. R-Adjusted square is 0.053. R square: Approximately only 0.06113, only around 6% of the variability in occupancy rates is explained by the model.

5.5.12 Comparison of Observed, Fitted, and Predicted Values for the lm_occupancy.2 Model

The following analysis demonstrates the application of a linear model (lm_occupancy.2) to predict occupancy rates for accommodation listings using a simulated dataset. The first plot compares observed and fitted values to evaluate the model’s ability to capture the relationship between occupancy rate and key predictors, such as accommodates and price. The second plot visualizes residuals to assess model accuracy and highlight potential discrepancies.

## 
## Call:
## lm(formula = occupancy_rate_30 ~ bathrooms + bedrooms + accommodates + 
##     beds + price + minimum_nights + review_scores_rating + neighbourhood, 
##     data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.7506  -5.2368   0.1478   4.9251  20.8384 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          46.98124   10.36582   4.532 1.79e-05 ***
## bathrooms            -0.30551    1.74368  -0.175    0.861    
## bedrooms              0.95745    1.02696   0.932    0.354    
## accommodates          0.38669    0.31109   1.243    0.217    
## beds                  0.27583    1.21724   0.227    0.821    
## price                 0.02294    0.01853   1.238    0.219    
## minimum_nights       -0.11808    0.11173  -1.057    0.293    
## review_scores_rating  0.06048    0.09330   0.648    0.518    
## neighbourhoodGracia  -1.25489    2.27162  -0.552    0.582    
## neighbourhoodSants   -0.97870    2.18544  -0.448    0.655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.852 on 90 degrees of freedom
## Multiple R-squared:  0.07826,    Adjusted R-squared:  -0.01391 
## F-statistic: 0.849 on 9 and 90 DF,  p-value: 0.5734

5.6 Generalised Linear Model with family set to Poisson

5.6.1 Introduction

Generalised Linear Model: Poission is extension of the Linear Model to deal with paritcular types of data: Count data. The Poisson Model, assumes a Poisson distribution of the data and uses the natural logarithm as a“link” function.

Therefore, for GLM:Poission Model the response variable is: availability_30

Defining Continous and Categorical variables: Coming from linear model continuing with the variables and keeping convert categorical variables in factors: host_is_superhost, room_type, property_type.

5.6.2 Fitting the model

Based on the data, firts graph shows availability by host_is_superhost :This plot shows how availability (availability_30) varies between hosts who are superhosts and those who are not. The difference in medians is not significant, indicating that the superhost status may not strongly influence availability. The range of availability is slightly larger for the “Non-Superhost” category, suggesting greater variability in this group.

Second graph ilustrates availability by Room Type :This plot compares availability (availability_30) across different room types (entire home/apt, private room, shared room)

After fit poisson model- glm_basic and analizing, overdispersion is found;
the rule when modelling count data. the overdispersion of this model calculate as deviance / residual is around 9. (All code details in RScript). For solving the Overdispersion in Poisson model,Quasipoisson model will be fitted later.

However, before fit quasi_Poisson model, Simulate New Observations: glm_basic because simulate() function doesn’t directly support quasi-Poisson models.

## [1] 9875
##   sim_1
## 1     5
## 2     9
## 3     7
## 4    21
## 5     5
## 6    12
##      sim_1
## 9873     9
## 9874    17
## 9875     6
## 9876    10
## 9877    10
## 9878     7

Based on simulation (Poisson simulation), first graph shows simulated availability_30 by host_is_superhost: Similar to the observed data, the differences between the categories are less pronounced.The range and variability are also slightly greater for non-superhosts, as in the observed data.

Second graph, shows the simulated availability_30 for each room type. The simulated data trends align well with the observed data: shared rooms have consistently lower availability, while entire apartments show greater variability. similarity between observed data and simulated distributions suggest that Poisson model is capturing main patterns.

Fit the Quassipoission Model. Display in some short way due to many levels of property_type

## 
## Call:
## glm(formula = availability_30 ~ price + host_listings_count + 
##     accommodates + minimum_nights + bathrooms + number_of_reviews_ltm + 
##     review_scores_rating + host_is_superhost + room_type + property_type + 
##     latitude + longitude, family = quasipoisson(link = "log"), 
##     data = BCN_Accomm)
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -4.518e+01  3.034e+01  -1.489 0.136537    
## price                                1.216e-03  8.533e-05  14.250  < 2e-16 ***
## host_listings_count                 -7.649e-05  2.119e-04  -0.361 0.718158    
## accommodates                         9.066e-04  7.053e-03   0.129 0.897724    
## minimum_nights                       2.028e-03  3.524e-04   5.756 8.89e-09 ***
## bathrooms                            7.216e-02  1.752e-02   4.118 3.84e-05 ***
## number_of_reviews_ltm               -7.809e-03  7.956e-04  -9.815  < 2e-16 ***
## review_scores_rating                -1.535e-03  2.803e-04  -5.478 4.42e-08 ***
## host_is_superhostt                  -1.073e-01  3.123e-02  -3.436 0.000593 ***
## room_typePrivate room                2.767e-01  3.028e-02   9.137  < 2e-16 ***
## room_typeShared room                 5.013e-01  9.877e-02   5.075 3.94e-07 ***
## property_typeApartment               4.811e-01  2.866e-01   1.679 0.093196 .  
## property_typeBarn                    3.623e-01  6.164e-01   0.588 0.556705
## [1] "    Null deviance: 99642  on 9874  degrees of freedom"
## [2] "Residual deviance: 92774  on 9837  degrees of freedom"
## [3] "AIC: NA"                                              
## [4] ""                                                     
## [5] "Number of Fisher Scoring iterations: 7"               
## [6] ""

price, host_listings_count, accommodates, latitude, longitude seems not to have impact in response variable. The rest variables seem to have an effect in the response variable. Several levels of factor property_type seems not to have effect in availability_30.

Therefore, lets check if property_type overall has effect in response variable using anova(). Fit a model without this factor and compare.

## Analysis of Deviance Table
## 
## Model 1: availability_30 ~ price + host_listings_count + accommodates + 
##     minimum_nights + bathrooms + number_of_reviews_ltm + review_scores_rating + 
##     host_is_superhost + room_type + property_type + latitude + 
##     longitude
## Model 2: availability_30 ~ price + host_listings_count + accommodates + 
##     minimum_nights + bathrooms + number_of_reviews_ltm + review_scores_rating + 
##     host_is_superhost + room_type + latitude + longitude
##   Resid. Df Resid. Dev  Df Deviance      F    Pr(>F)    
## 1      9837      92774                                  
## 2      9862      93275 -25  -501.41 2.2699 0.0002976 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1: glm_quasipoisson provides a significantly better explanation of the variability in availability_30. The variable property_type_group seems to have relevant role in the model. Therefore, not remove property_type from the model.

Next step, refit the model removing the variables might have not impact on response variable and anlysis.

5.6.3 quasiPoisson glm_quasi_updated Model: Interpretation and coefficients

## 
## Call:
## glm(formula = availability_30 ~ minimum_nights + bathrooms + 
##     number_of_reviews_ltm + review_scores_rating + host_is_superhost + 
##     room_type + property_type, family = quasipoisson(link = "log"), 
##     data = BCN_Accomm)
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          1.580e+00  2.890e-01   5.469 4.64e-08 ***
## minimum_nights                       1.456e-03  4.181e-04   3.483 0.000497 ***
## bathrooms                            1.135e-01  1.477e-02   7.689 1.63e-14 ***
## number_of_reviews_ltm               -8.098e-03  7.985e-04 -10.141  < 2e-16 ***
## review_scores_rating                -1.533e-03  2.820e-04  -5.435 5.62e-08 ***
## host_is_superhostt                  -9.516e-02  3.116e-02  -3.054 0.002263 ** 
## room_typePrivate room                1.655e-01  2.222e-02   7.446 1.04e-13 ***
## room_typeShared room                 3.306e-01  9.925e-02   3.331 0.000869 ***
## property_typeApartment               4.661e-01  2.886e-01   1.615 0.106280    
## property_typeBarn                    2.364e-01  6.199e-01   0.381 0.702889    
## property_typeBed and breakfast       4.079e-01  3.063e-01   1.332 0.182960
## 
## ...
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 8.966836)
## 
##     Null deviance: 99642  on 9874  degrees of freedom
## Residual deviance: 94417  on 9842  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 7

Some relevant interprtations: intercept:3.327e-01 [(e^0.3327)]:in the absence of other predictors, or when all numeric predictors are 0 and categorical predictors are at their reference levels:The expected availability (availability_30) is approximately 1.39 days minimum_nights:1.278×10^-3: For every additional night in the minimum_nights requirement, the expected availability (availability_30) increases by approximately 0.13% [(e^0.00128-1)x100] bathrooms: 9.377×10 ^-2: For every additional bathroom, the expected availability increases by approximately 9.81% [(e^0.009377-1)x100] number_of_reviews_ltm: −7.280×10^-3: For every additional review in the last month, the expected availability decreases by approximately 0.73% [(e^0.00728-1)x100] host_is_superhost:−1.182x10^-1: Superhosts have 11.15% lower availability compared to non-superhosts (reference level) room_typePrivate room:1.311x10^-1: Private rooms have 13.96% higher availability compared to the reference level (Entire home/apt) room_typeShared room:4.033×10 ^-1:Shared rooms have 49.62% higher availability compared to the reference level property_typeBoat:9.469×10 ^-1:Boats have 158.69% higher availability compared to the reference level property type property_typeBoutique hotel:Boutique hotels have 117.66% higher availability compared to the reference level property type

5.6.4 Model Metrics

## Null Deviance: 99642.31
## Residual Deviance: 92774.07
## Deviance Explained: 6.892891 %

Interpretation: Dispersion Parameter:For the quasi-Poisson family, this is a measure of overdispersion in the data.Overdispersion:(9.20). Null Deviance: represents the deviance (a measure of goodness of fit) for a model with only the intercept (no predictors)(102352). Residual Deviance: represents the deviance for the fitted model (including predictors): (970049) Deviance Explained: how much of the variation in the response variable is explained by the model: 9.14%

5.6.5 General linear hypothesis test (glth)

three levels of room_type:

## Entire home/apt
## Private room
## Shared room

hypothesis 1: if the Privacy accommodation differ from shared accommodation: Entire home/apt and Private room (private), together, comparing Shared room (shared) regarding effect availability_30 (quasi poisson model:glm_quasi_updated )

hypothesis 2: if private room differs from shared room regarding their effect in availability_30 (quasi poisson model:glm_quasi_updated )

Matrix contrast for the two hypotheses:

##                             Entire home/apt Private room Shared room
## privacy vs shared                       0.5          0.5          -1
## private room vs shared room             0.0          1.0          -1
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: glm(formula = availability_30 ~ minimum_nights + bathrooms + 
##     number_of_reviews_ltm + review_scores_rating + host_is_superhost + 
##     room_type + property_type, family = quasipoisson(link = "log"), 
##     data = BCN_Accomm)
## 
## Linear Hypotheses:
##                                  Estimate Std. Error z value Pr(>|z|)  
## privacy vs shared == 0           -0.24786    0.09803  -2.528   0.0129 *
## private room vs shared room == 0 -0.16512    0.09805  -1.684   0.1009  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

There is a clear significant difference among privacy accommodation (entire house/apt and private room ) and shared accomodation (shared rooms) regarding availability_30. Private accommodations (entire home/apt and Private room) have significantly lower availability compared to shared accommodations (Shared room).

There is a clear significant diffence among private room and shared room regarding availability_30. Private room has significantly lower availability compared to Shared room.

5.6.6 Effect of number_of_reviews_ltm on Availability

How when increase the number_of_reviews_ltm affect to availability_30 by the quasiPoisson model:

## Effect of 10 additional reviews:  0.9222142
## Percentage change in availability:  -7.778583 %
## Effect of 50 additional reviews:  0.6670509
## Percentage change in availability:  -33.29491 %

This reflects that listings with many reviews are more popular and tend to be booked more frequently.

5.6.7 Predictions model glm_quasi_updated

####5.6.7.1 Predict Availability for New Data

The aim is to know the days of the response variable for a 3 concrete accommodations using predict() function. Accommodation 1: Minimum Nights: 3, Bathrooms: 1, Reviews: 20, Rating: 90, Superhost: No, Room Type: Private Room, Property Type: Apartment Accommodation 2: Minimum Nights: 5, Bathrooms: 2, Reviews: 50, Rating: 95, Superhost: No, Room Type: Entire Home/Apt, Property Type: House Accommodation 3: Minimum Nights: 7, Bathrooms: 3, Reviews: 100, Rating: 85, Superhost: Yes, Room Type: Shared Room, Property Type: Boat

##  1  2  3 
##  8  5 11

Interpretation of Predictions accomodation 1: For the first hypothetical listing, the predicted availability_30 is approximately 8 days. accomodation 2: For the second listing, the predicted availability_30 is approximately 5 days. accomodation 3: For the third listing, the predicted availability_30 is approximately 13 days.

5.6.7.2 Predict Availability for New Data

Following, predicted availability values_30days for the first six listings in original dataset (BCN_Accomm), based on the fitted quasi-Poisson model (glm_quasi_updated):

# Predict availability for all listings in the original dataset
predicted_availability <- round(predict(glm_quasi_updated, type = "response"), digits=0)
head(predicted_availability)  # View the first few predictions
##  1  2  3  4  5  6 
##  8  8  7 13  9  9

Interpretation: Listing 1: Predicted availability is 8 days. Listing 2: Predicted availability is 9 days. Listing 3: Predicted availability is 11 days. Listing 4: Predicted availability is 7 days. Listing 5: Predicted availability is 4 days. Listing 6: Predicted availability is 6 days. These predictions are on the response scale (availability_30) and represent the expected availability based on the predictors in the original dataset.

summary(predicted_availability)

Comparison actual values of availability_30 with the predicted values from model: From table above, the predicted values are systematically higher than actual values for lower availability.

##   Actual Predicted
## 1      9         8
## 2     18         8
## 3     21         7
## 4      3        13
## 5      2         9
## 6      0         9

Visualization Histogram of Predicted Availability: The following graph shows the most data concentrate in the range 5-10 days

Evaluation of the prediction a) Residual analysis

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.10e+01 -7.00e+00 -2.00e+00 -5.06e-04  5.00e+00  2.40e+01

Positive Residuals: Underprediction (model predicts lower availability than actual). Negative Residuals: Overprediction (model predicts higher availability than actual).

Visualization Scatter Plot: Actual vs. Predicted

Points above the red line indicate underprediction. Points below the red line indicate overprediction.

Model Performance Metrics a)Mean Absolute Error (MAE) The average absolute difference between predicted and actual values:

## Mean Absolute Error (MAE): 7

Lower MAE and RMSE indicate better accuracy. On average, the predictions are 7 days off from the actual availability.

  1. Root Mean Squared Error (RMSE) A measure that penalizes large errors more heavily:
## Root Mean Squared Error (RMSE): 9

It is slightly higher than the MAE: 9 days, showing that some predictions have large errors.

5.7 Generalised Linear Model for Multinomial data

To explore predictions for price and occupancy rate we can perform GLM for classification models considering the two response variables as multinomial. This approach requires to classify their values into categories.

5.7.1 Multinomial GLM for price prediction

We want to further explore what are the key factors influencing accommodation prices in Barcelona. In particular, we would like to analyse what price can be predicted according to the variables above listed.

To fit a multinomial model it is required to convert the prices in different ranges. In this case we divide those into five ordered categories: - ‘very low’ >> between 0 and 50 euros, - ‘low’ >> between 50 and 150 euros, - ‘medium’ between 150 and 300 euros, - ‘high’ >> between 300 and 500 euros, - ‘very high’ >> between 500 and 1000 euros.

train_normalized_encod$price_cat <- cut(train_normalized$price, 
                       breaks = c(0, 0.05, 0.15, 0.30, 0.5, 1), 
                       labels = c("very low", "low", "medium", "high", "very high"))
table(train_normalized_encod$price_cat) # check the distribution
## 
##  very low       low    medium      high very high 
##      2599      2552       546       181        64
test_normalized_encod$price_cat <- cut(test_normalized$price, 
                       breaks = c(0, 0.05, 0.15, 0.30, 0.5, 1), 
                       labels = c("very low", "low", "medium", "high", "very high"))
table(test_normalized_encod$price_cat) # check the distribution
## 
##  very low       low    medium      high very high 
##      1757      1621       380       123        51
# check if categories are right skewed - if yes, apply log transformation before fitting the model
hist(train_normalized_encod$price,
     main = "Histogram",
     xlab = "variable",
     col = "lightblue",
     border = "black",
     breaks = 30)

Now we want to fit a model to test what is the probability that a property belongs to a specific price category according to the other variables. Since the dependent variable is ordinal (with an inherent order) we use family ‘cumulative’ with link ‘logit’.

# multinomial logistic regression model for ordered categories
vglm_model <- vglm(price_cat ~ bathrooms + bedrooms + accommodates + beds + 
                     latitude + longitude + review_scores_rating + minimum_nights + room_type + neighbourhood, 
                   family = cumulative(link = "logit", parallel = TRUE), 
                   data = train_normalized_encod)
summary(vglm_model)
## Call:
## vglm(formula = price_cat ~ bathrooms + bedrooms + accommodates + 
##     beds + latitude + longitude + review_scores_rating + minimum_nights + 
##     room_type + neighbourhood, family = cumulative(link = "logit", 
##     parallel = TRUE), data = train_normalized_encod)
## 
## Coefficients: 
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1         -0.80354    0.16245  -4.946 7.56e-07 ***
## (Intercept):2          2.27531    0.16501  13.789  < 2e-16 ***
## (Intercept):3          3.75213    0.17737  21.154  < 2e-16 ***
## (Intercept):4          5.22323    0.20723  25.206  < 2e-16 ***
## bathrooms             -2.35937    0.52482  -4.496 6.94e-06 ***
## bedrooms              -0.35066    0.58314  -0.601 0.547627    
## accommodates          -5.66726    0.43034 -13.169  < 2e-16 ***
## beds                   2.12288    1.09478   1.939 0.052490 .  
## latitude               1.08986    0.22608   4.821 1.43e-06 ***
## longitude             -0.79633    0.21647  -3.679 0.000234 ***
## review_scores_rating   0.39166    0.07238   5.411 6.26e-08 ***
## minimum_nights        30.06395    2.29519  13.099  < 2e-16 ***
## room_typePrivate room  1.97620    0.08102  24.393  < 2e-16 ***
## room_typeShared room   3.27239    0.40169   8.147 3.74e-16 ***
## neighbourhood         -0.44121    1.20100  -0.367 0.713344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
## 
## Residual deviance: 9513.102 on 23753 degrees of freedom
## 
## Log-likelihood: NA on 23753 degrees of freedom
## 
## Number of Fisher scoring iterations: 8 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
##             bathrooms              bedrooms          accommodates 
##          9.447993e-02          7.042257e-01          3.457331e-03 
##                  beds              latitude             longitude 
##          8.355133e+00          2.973861e+00          4.509801e-01 
##  review_scores_rating        minimum_nights room_typePrivate room 
##          1.479430e+00          1.139221e+13          7.215248e+00 
##  room_typeShared room         neighbourhood 
##          2.637434e+01          6.432577e-01

We want to get the predicted probabilities for each price category and evaluate how many predictions are correct or incorrect in comparison with the test set.

# to get the predicted probabilities for each category
# it gives a matrix n*(K-1) where n is number of rows in test and K number of categories in the ordinal variable
predicted_probs <- predict(vglm_model, newdata = test_normalized_encod, type = "response")
# convert those probabilities into the most likely predicted category by selecting the category with the highest probability for each observation
# the result is a vector of predicted categories corresponding o each row of the test set
predicted_categories <- apply(predicted_probs, 1, which.max)
# compare predicted categories with actual
# Create a confusion matrix to show how many observations in the test set were correctly/incorrectly classified in each category of price range
confusion_matrix <- table(Predicted = predicted_categories, Actual = test_normalized_encod$price_cat)
print(confusion_matrix)
##          Actual
## Predicted very low  low medium high very high
##         1     1517  502     33   10         9
##         2      237 1105    305   81        23
##         3        3   12     25   17        13
##         4        0    1     11    4         4
##         5        0    1      6   11         2

According to the confusion matrix, the most correct predictions concern the categories ‘very low’ and ‘low’ prices, while for the others categories there is not a good result.

To evaluate the performance of the model we make the following considerations on metrics. - RMSE (Root Mean Squared Error) can be used to evaluate the performance of a model, but it’s typically used for regression tasks rather than classification tasks. Since we are performing a classification model, RMSE is not the most appropriate metric to assess its performance. - MAE (Mean Absolute Error). Since the model is a multinomial classification model (predicting categorical outcomes like different price_grouped levels), the target variable is categorical, not continuous. As a result, MAE is not a suitable metric for this type of model, because MAE operates on the difference between continuous values (actual and predicted). - R Squared (R²) is a metric primarily used to evaluate the performance of regression models, not classification models. It indicates how well the independent variables explain the variability in the dependent variable, essentially showing the proportion of variance in the dependent variable that is predictable from the independent variables. Instead, we consider the precision and recall. - Precision and recall: Measures for each class how often the model correctly identifies a class (precision) or how often the model identifies all true instances of a class (recall).

# Precision and recall
# Initialize vectors to store precision and recall for each class
precision <- numeric(length = ncol(confusion_matrix))
recall <- numeric(length = ncol(confusion_matrix))

# Loop over each class to calculate precision and recall
for (k in 1:ncol(confusion_matrix)) {
  TP <- confusion_matrix[k, k]  # True Positives
  FP <- sum(confusion_matrix[k, ]) - TP  # False Positives
  FN <- sum(confusion_matrix[, k]) - TP  # False Negatives
  # TN (True Negatives) are not used directly here
  
  # Calculate precision and recall for class k
  precision[k] <- TP / (TP + FP)
  recall[k] <- TP / (TP + FN)
}

# Print Precision and Recall per class
print("Precision per class:")
## [1] "Precision per class:"
print(precision)
## [1] 0.7324964 0.6310680 0.3571429 0.2000000 0.1000000
print("Recall per class:")
## [1] "Recall per class:"
print(recall)
## [1] 0.86340353 0.68167798 0.06578947 0.03252033 0.03921569
# Calculate average Precision and Recall
average_precision <- mean(precision, na.rm = TRUE)
average_recall <- mean(recall, na.rm = TRUE)

print(paste("Average Precision:", average_precision))
## [1] "Average Precision: 0.404141439373797"
print(paste("Average Recall:", average_recall))
## [1] "Average Recall: 0.336521398092365"

The average of precision and recall metrics, considering the three categories of occupancy rate, are respectively around 40% and 34%. It means that there is around 40% chance to get a correct prediction and only 34% of the true instances are correctly identified. In conclusion, the classification model is not performing very well.

5.7.2 Multinomial GLM for occupancy rate prediction

The model want to see how we can we predict occupancy rates based on location, amenities, or other factors. To fit the multinomial model it is required to convert the occupancy rate in different ranges. Considering the non-linear distribution of the data points we prefer a data-driven approach using quantile-based bins to divide the data into categories that have approximately the same number of observations. Since the values for occupancy rate run between 0 and 100 we divide them in the categories ‘low’, ‘medium’ and ‘high’.

train_normalized_encod_glm <- train_normalized_encod
test_normalized_encod_glm <- test_normalized_encod

quantiles_train <- quantile(train_normalized_encod_glm$occupancy_rate_30, probs = c(0, 1/3, 2/3, 1))
quantiles_train
##        0% 33.33333% 66.66667%      100% 
##   0.00000  70.00000  93.33333 100.00000
train_normalized_encod_glm$occupancy_rate_30 <- cut(train_normalized_encod_glm$occupancy_rate_30,
                                breaks = quantiles_train, 
                                labels = c("low", "medium", "high"), 
                                include.lowest = TRUE)

table(train_normalized_encod_glm$occupancy_rate_30_glm) # check the distribution
## < table of extent 0 >
quantiles_test <- quantile(test_normalized_encod_glm$occupancy_rate_30, probs = c(0, 1/3, 2/3, 1))

test_normalized_encod_glm$occupancy_rate_30 <- cut(test_normalized_encod_glm$occupancy_rate_30, 
                                breaks = quantiles_test, 
                                labels = c("low", "medium", "high"), 
                                include.lowest = TRUE)

table(test_normalized_encod_glm$occupancy_rate_30) # check the distribution
## 
##    low medium   high 
##   1433   1233   1266

Now we want to fit a model to test what is the probability that a property has a specific occupancy rate in the next 30 days according to the other variables. Since the dependent variable is ordinal (with an inherent order) we use family ‘cumulative’ with link ‘logit’.

vglm_occupancy <- vglm(occupancy_rate_30 ~ bathrooms + bedrooms + accommodates + beds + latitude + longitude + review_scores_rating + minimum_nights + price + neighbourhood,
                   family = cumulative(link = "logit", parallel = TRUE), 
                   data = train_normalized_encod_glm)
summary(vglm_occupancy)
## Call:
## vglm(formula = occupancy_rate_30 ~ bathrooms + bedrooms + accommodates + 
##     beds + latitude + longitude + review_scores_rating + minimum_nights + 
##     price + neighbourhood, family = cumulative(link = "logit", 
##     parallel = TRUE), data = train_normalized_encod_glm)
## 
## Coefficients: 
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1        -0.93537    0.13361  -7.001 2.55e-12 ***
## (Intercept):2         0.42327    0.13321   3.178  0.00149 ** 
## bathrooms             1.88802    0.47064   4.012 6.03e-05 ***
## bedrooms             -1.33954    0.53547  -2.502  0.01236 *  
## accommodates         -0.20559    0.35886  -0.573  0.56670    
## beds                  1.00582    1.04866   0.959  0.33748    
## latitude             -0.02823    0.19007  -0.149  0.88191    
## longitude             0.03205    0.18750   0.171  0.86429    
## review_scores_rating -0.04403    0.06239  -0.706  0.48036    
## minimum_nights       -0.42583    1.12780  -0.378  0.70574    
## price                 2.97850    0.31477   9.463  < 2e-16 ***
## neighbourhood        -0.70877    1.09426  -0.648  0.51717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])
## 
## Residual deviance: 12872.33 on 11874 degrees of freedom
## 
## Log-likelihood: -6436.165 on 11874 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
##            bathrooms             bedrooms         accommodates 
##            6.6063076            0.2619653            0.8141639 
##                 beds             latitude            longitude 
##            2.7341412            0.9721609            1.0325659 
## review_scores_rating       minimum_nights                price 
##            0.9569219            0.6532252           19.6582870 
##        neighbourhood 
##            0.4922482

The summary of the model provides two intercepts that should be interpreted respectively as the threshold (logit) between low and medium/high and low/medium and high. As we used ‘logit’ as link for the model, we consider the exponentied coefficients that represent the odds ratio for a one-unit increase in the predictor, assuming all other variables held constant. For example, adding one bed increases the odds of a higher occupancy rate by a factor of 2.7, while adding a bedroom increases the odds of a higher occupancy rate by a factor of 0.26. According to this difference, one could claim that groups of visitor prefer smaller properties with shared spaces instead of bigger accommodations. For every unit increase in price (any additional euro), the odds of being in a higher occupancy category increase by a factor of 1. Considering the unit of 1 euro represents a small increase in real price, we can conclude that the price has a modest impact on the likelihood of higher occupancy rate category.

We want to get the predicted probabilities for each occupancy rate category and evaluate how many predictions from the model are correct or incorrect in comparison with the test set.

# to get the predicted probabilities for each category
predicted_probs <- predict(vglm_occupancy, newdata = test_normalized_encod_glm, type = "response")
# Get the predicted categories by selecting the class with the highest probability
predicted_categories <- apply(predicted_probs, 1, function(x) {
  c("low", "medium", "high")[which.max(x)]
})
table(predicted_categories)
## predicted_categories
## high  low 
## 2030 1902

The confusion matrix shows that the model is not predicting probabilities of occupancy rate range for the class ‘medium’.

head(predicted_probs)
##          low     medium       high
## 2  0.2946401 0.32444848 0.38091140
## 4  0.8984865 0.07329506 0.02821846
## 7  0.3090546 0.32602989 0.36491554
## 12 0.2625250 0.31819973 0.41927523
## 14 0.3526358 0.32679599 0.32056817
## 15 0.3250433 0.32698003 0.34797671
predicted_probabilities_class2 <- predicted_probs[, 2]
summary(predicted_probabilities_class2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07129 0.32497 0.32638 0.32025 0.32695 0.32717

Inspecting the predicted probabilities, it appears that the values for the intermediate category are consistently near 0 for the observations. This might indicate that the model is ignoring this class and giving results only for the others.

# Ensure the levels are the same
# Check the levels of the true labels and predicted categories
predicted_categories <- factor(predicted_categories, levels = c("low", "medium", "high"))
actual_categories <- factor(test_normalized_encod_glm$occupancy_rate_30, levels = c("low", "medium", "high"))
conf_matrix <- confusionMatrix(predicted_categories, actual_categories)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low medium high
##     low    783    642  477
##     medium   0      0    0
##     high   650    591  789
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3998          
##                  95% CI : (0.3844, 0.4153)
##     No Information Rate : 0.3644          
##     P-Value [Acc > NIR] : 2.537e-06       
##                                           
##                   Kappa : 0.0871          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity              0.5464        0.0000      0.6232
## Specificity              0.5522        1.0000      0.5345
## Pos Pred Value           0.4117           NaN      0.3887
## Neg Pred Value           0.6798        0.6864      0.7492
## Prevalence               0.3644        0.3136      0.3220
## Detection Rate           0.1991        0.0000      0.2007
## Detection Prevalence     0.4837        0.0000      0.5163
## Balanced Accuracy        0.5493        0.5000      0.5789

The accuracy of the model is around 40%. It means that there is around 40% chance to get a correct prediction that cannot be a satisfactory result. To evaluate the performance of the model, we consider the precision and recall.

# Precision and recall
table_conf_matrix <- conf_matrix$table
# Initialize vectors to store precision and recall for each class
precision <- numeric(length = ncol(table_conf_matrix))
recall <- numeric(length = ncol(table_conf_matrix))

# Loop over each class to calculate precision and recall
for (k in 1:ncol(table_conf_matrix)) {
  TP <- table_conf_matrix[k, k]  # True Positives
  FP <- sum(table_conf_matrix[k, ]) - TP  # False Positives
  FN <- sum(table_conf_matrix[, k]) - TP  # False Negatives
  # TN (True Negatives) are not used directly here
  
  # Calculate precision and recall for class k
  precision[k] <- TP / (TP + FP)
  recall[k] <- TP / (TP + FN)
}

# Print Precision and Recall per class
print("Precision per class:")
## [1] "Precision per class:"
print(precision)
## [1] 0.4116719       NaN 0.3886700
print("Recall per class:")
## [1] "Recall per class:"
print(recall)
## [1] 0.5464061 0.0000000 0.6232227
# Calculate average Precision and Recall
average_precision <- mean(precision, na.rm = TRUE)
average_recall <- mean(recall, na.rm = TRUE)

print(paste("Average Precision:", average_precision))
## [1] "Average Precision: 0.400170937514569"
print(paste("Average Recall:", average_recall))
## [1] "Average Recall: 0.389876296592727"

The precision of the model is around 40%, similar to the result achieved in the accuracy of the confusion matrix. The recall value demonstrates that around 39% of the true instances are correctly identified. In conclusion, the classification model is not performing very well as it also happened in the GLM classification model for the price prediction.

5.8 Generalised Additive Model

In this chapter, Generalized Additive Models (GAMs) will be applied with the Price variable as the response, to analyze its interactions with the predictor variables.

The first goal is to identify key factors influencing prices in Barcelona, addressing Research Question 1 (What are the key factors influencing accommodation prices in Barcelona?)

First, we aim to determine whether a nonlinear relationship exists between the independent variables and price. To explore this, the variable Review Score Rating will be plotted against Price to visualize whether the relationship is linear or not.

From the plot above and Chapter 4.3 of this report, we observe that many points are concentrated on the right side, where higher review scores are paired with lower prices. This suggests a lack of a strong relationship between Review Score Ratings and Price.

Given that at least one variable does not exhibit a linear relationship, we will proceed with applying a Generalized Additive Model (GAM) to better capture potential nonlinear interactions.

5.8.1 GAM Model Training for Price Prediction

The GAM model is performed using the training data.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## price ~ s(bathrooms) + s(bedrooms) + s(accommodates) + s(beds) + 
##     s(latitude) + s(longitude) + s(review_scores_rating) + s(minimum_nights) + 
##     room_type + neighbourhood
## 
## Parametric coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                   174.428     53.585   3.255
## room_typePrivate room                         -54.549      3.773 -14.456
## room_typeShared room                          -79.789     12.466  -6.400
## neighbourhoodCamp d'en Grassot i Gràcia Nova  -67.663     55.026  -1.230
## neighbourhoodCan Baro                         -44.474     62.204  -0.715
## neighbourhoodCarmel                           -58.667     55.395  -1.059
## neighbourhoodCiutat Vella                     -54.459     54.607  -0.997
## neighbourhoodDiagonal Mar - La Mar Bella      -13.863     57.198  -0.242
## neighbourhoodDreta de l'Eixample              -26.482     54.019  -0.490
## neighbourhoodEixample                         -47.034     53.738  -0.875
## neighbourhoodEl Baix Guinardó                 -47.524     55.788  -0.852
## neighbourhoodEl Besòs i el Maresme            -67.817     56.913  -1.192
## neighbourhoodEl Bon Pastor                    -78.193     67.115  -1.165
## neighbourhoodEl Born                          -62.377     56.385  -1.106
## neighbourhoodEl Camp de l'Arpa del Clot       -64.039     55.512  -1.154
## neighbourhoodEl Clot                          -46.368     57.012  -0.813
## neighbourhoodEl Coll                          -99.812     80.982  -1.233
## neighbourhoodEl Congrés i els Indians         -73.326     57.555  -1.274
## neighbourhoodel Fort Pienc                    -53.375     55.087  -0.969
## neighbourhoodEl Gòtic                         -51.478     55.038  -0.935
## neighbourhoodEl Poble-sec                     -63.246     54.885  -1.152
## neighbourhoodEl Poblenou                      -55.890     55.754  -1.002
## neighbourhoodEl Putget i Farró                 33.413     54.681   0.611
## neighbourhoodEl Raval                         -56.553     54.594  -1.036
## neighbourhoodGlòries - El Parc                -74.015     55.891  -1.324
## neighbourhoodGràcia                           -37.853     53.329  -0.710
## neighbourhoodGuinardó                          -2.162     55.442  -0.039
## neighbourhoodHorta                            -36.868     72.911  -0.506
## neighbourhoodHorta-Guinardó                   -43.272     52.805  -0.819
## neighbourhoodL'Antiga Esquerra de l'Eixample  -47.718     53.980  -0.884
## neighbourhoodLa Barceloneta                   -49.809     56.159  -0.887
## neighbourhoodLa Font d'en Fargues             -65.480     80.839  -0.810
## neighbourhoodLa Maternitat i Sant Ramon       -80.454     55.530  -1.449
## neighbourhoodLa Nova Esquerra de l'Eixample   -53.857     54.227  -0.993
## neighbourhoodLa Prosperitat                   -11.584    103.804  -0.112
## neighbourhoodLa Sagrada Família               -61.659     54.290  -1.136
## neighbourhoodLa Sagrera                       -39.724     59.062  -0.673
## neighbourhoodLa Salut                         -59.806     56.479  -1.059
## neighbourhoodLa Teixonera                     -94.493     68.290  -1.384
## neighbourhoodLa Trinitat Vella                -87.274     84.294  -1.035
## neighbourhoodLa Verneda i La Pau              -73.838     63.074  -1.171
## neighbourhoodLa Vila Olímpica                 -36.563     58.240  -0.628
## neighbourhoodLes Corts                        -60.479     53.881  -1.122
## neighbourhoodLes Tres Torres                  -49.452     63.813  -0.775
## neighbourhoodMontbau                          -39.033     72.183  -0.541
## neighbourhoodNavas                            -65.133     57.272  -1.137
## neighbourhoodNou Barris                       -51.330     54.858  -0.936
## neighbourhoodPedralbes                        -13.899     65.834  -0.211
## neighbourhoodPorta                            -46.894     81.671  -0.574
## neighbourhoodProvençals del Poblenou          -53.560     59.460  -0.901
## neighbourhoodSant Andreu                      -61.706     54.465  -1.133
## neighbourhoodSant Andreu de Palomar           -50.605     58.877  -0.860
## neighbourhoodSant Antoni                      -51.243     54.544  -0.939
## neighbourhoodSant Genís dels Agudells         -22.436     67.365  -0.333
## neighbourhoodSant Gervasi - Galvany           -51.299     54.462  -0.942
## neighbourhoodSant Gervasi - la Bonanova       -39.891     59.974  -0.665
## neighbourhoodSant Martí                       -56.760     54.552  -1.040
## neighbourhoodSant Martí de Provençals         -63.418     57.789  -1.097
## neighbourhoodSant Pere/Santa Caterina         -54.769     55.116  -0.994
## neighbourhoodSants-Montjuïc                   -58.904     53.901  -1.093
## neighbourhoodSarrià                           -43.530     57.619  -0.755
## neighbourhoodSarrià-Sant Gervasi              -31.073     53.380  -0.582
## neighbourhoodTrinitat Nova                    -34.386     83.495  -0.412
## neighbourhoodTuró de la Peira - Can Peguera   -38.170     72.901  -0.524
## neighbourhoodVallcarca i els Penitents        -14.720     55.757  -0.264
## neighbourhoodVerdum - Los Roquetes            -57.912     75.111  -0.771
## neighbourhoodVila de Gràcia                   -36.911     53.616  -0.688
## neighbourhoodVilapicina i la Torre Llobeta    -50.991     59.803  -0.853
##                                              Pr(>|t|)    
## (Intercept)                                   0.00114 ** 
## room_typePrivate room                         < 2e-16 ***
## room_typeShared room                         1.67e-10 ***
## neighbourhoodCamp d'en Grassot i Gràcia Nova  0.21887    
## neighbourhoodCan Baro                         0.47465    
## neighbourhoodCarmel                           0.28962    
## neighbourhoodCiutat Vella                     0.31867    
## neighbourhoodDiagonal Mar - La Mar Bella      0.80850    
## neighbourhoodDreta de l'Eixample              0.62398    
## neighbourhoodEixample                         0.38148    
## neighbourhoodEl Baix Guinardó                 0.39432    
## neighbourhoodEl Besòs i el Maresme            0.23347    
## neighbourhoodEl Bon Pastor                    0.24404    
## neighbourhoodEl Born                          0.26865    
## neighbourhoodEl Camp de l'Arpa del Clot       0.24871    
## neighbourhoodEl Clot                          0.41607    
## neighbourhoodEl Coll                          0.21781    
## neighbourhoodEl Congrés i els Indians         0.20271    
## neighbourhoodel Fort Pienc                    0.33262    
## neighbourhoodEl Gòtic                         0.34966    
## neighbourhoodEl Poble-sec                     0.24923    
## neighbourhoodEl Poblenou                      0.31617    
## neighbourhoodEl Putget i Farró                0.54119    
## neighbourhoodEl Raval                         0.30030    
## neighbourhoodGlòries - El Parc                0.18547    
## neighbourhoodGràcia                           0.47785    
## neighbourhoodGuinardó                         0.96890    
## neighbourhoodHorta                            0.61311    
## neighbourhoodHorta-Guinardó                   0.41255    
## neighbourhoodL'Antiga Esquerra de l'Eixample  0.37674    
## neighbourhoodLa Barceloneta                   0.37515    
## neighbourhoodLa Font d'en Fargues             0.41797    
## neighbourhoodLa Maternitat i Sant Ramon       0.14744    
## neighbourhoodLa Nova Esquerra de l'Eixample   0.32066    
## neighbourhoodLa Prosperitat                   0.91115    
## neighbourhoodLa Sagrada Família               0.25612    
## neighbourhoodLa Sagrera                       0.50124    
## neighbourhoodLa Salut                         0.28969    
## neighbourhoodLa Teixonera                     0.16650    
## neighbourhoodLa Trinitat Vella                0.30055    
## neighbourhoodLa Verneda i La Pau              0.24179    
## neighbourhoodLa Vila Olímpica                 0.53017    
## neighbourhoodLes Corts                        0.26171    
## neighbourhoodLes Tres Torres                  0.43840    
## neighbourhoodMontbau                          0.58870    
## neighbourhoodNavas                            0.25547    
## neighbourhoodNou Barris                       0.34947    
## neighbourhoodPedralbes                        0.83280    
## neighbourhoodPorta                            0.56586    
## neighbourhoodProvençals del Poblenou          0.36774    
## neighbourhoodSant Andreu                      0.25728    
## neighbourhoodSant Andreu de Palomar           0.39010    
## neighbourhoodSant Antoni                      0.34753    
## neighbourhoodSant Genís dels Agudells         0.73910    
## neighbourhoodSant Gervasi - Galvany           0.34627    
## neighbourhoodSant Gervasi - la Bonanova       0.50599    
## neighbourhoodSant Martí                       0.29816    
## neighbourhoodSant Martí de Provençals         0.27251    
## neighbourhoodSant Pere/Santa Caterina         0.32041    
## neighbourhoodSants-Montjuïc                   0.27452    
## neighbourhoodSarrià                           0.44998    
## neighbourhoodSarrià-Sant Gervasi              0.56052    
## neighbourhoodTrinitat Nova                    0.68047    
## neighbourhoodTuró de la Peira - Can Peguera   0.60058    
## neighbourhoodVallcarca i els Penitents        0.79179    
## neighbourhoodVerdum - Los Roquetes            0.44072    
## neighbourhoodVila de Gràcia                   0.49121    
## neighbourhoodVilapicina i la Torre Llobeta    0.39389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df      F  p-value    
## s(bathrooms)            4.288  5.251 11.844  < 2e-16 ***
## s(bedrooms)             5.078  5.968  4.426 0.000178 ***
## s(accommodates)         2.155  2.811 23.342  < 2e-16 ***
## s(beds)                 5.157  6.011  3.090 0.005113 ** 
## s(latitude)             7.387  8.429  6.114  < 2e-16 ***
## s(longitude)            1.011  1.022  4.651 0.030939 *  
## s(review_scores_rating) 4.113  4.990 17.244  < 2e-16 ***
## s(minimum_nights)       5.147  5.957 46.977  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.335   Deviance explained = 34.6%
## -REML =  34744  Scale est. = 7667.3    n = 5943

The model explains 33.6% of the deviance, indicating moderate exploratory power with potential room for improvement.

The model identified key findings that indicate bathroom facilities, accommodation, and review scores as the strongest predictors, with higher values showing a significant increase in price. Spatial variables such as latitude and longitude were also found to be significant .Among the categorical predictors, room type exhibited a significant impact, with private rooms and shared rooms reducing prices compared to entire homes/apartments. While neighborhoods did not show strong statistical significance.

5.8.2 Gam Model Prediction of Pricing evaluation

## R-squared on Test Set:  0.3279427

The plot above compares the observed prices against the predicted values generated by the model. It can be observed that the points deviate more from the red line at higher price ranges, indicating that the model performs better for lower price ranges. This behavior can be attributed to the presence of outliers, which may influence the model’s ability to accurately predict higher prices.

5.8.3 GAM Model: Evaluating Predictive Pricing Performance

## MAE:  45.69611
## RMSE:  91.81648
## R-squared:  0.3279427

The results above provides key evaluation metrics for the GAM model used to predict Airbnb prices.

  • The Mean Absolute Error (MAE) is 46.30. This means that, on average, the predicted prices are 46.30 units different to the observed prices.

Root Mean Square Error (RMSE): The RMSE is 87.05, suggesting that there are some outliers or significant deviations between observed and predicted prices, especially for higher-priced Airbnb’s

These metrics suggest that while the GAM provides a goof fit for lower-priced listings, its predictive performance is less accurate for higher prices, likely due to the influence of outliers.

5.8.4 Gam Model Training for Occupancy Prediction

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## occupancy_rate_30 ~ s(latitude) + s(longitude) + s(bathrooms) + 
##     s(bedrooms) + s(accommodates) + s(beds) + s(price) + s(minimum_nights) + 
##     s(review_scores_rating) + (neighbourhood)
## 
## Parametric coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   66.84643   20.78582   3.216
## neighbourhoodCamp d'en Grassot i Gràcia Nova  15.29053   21.21276   0.721
## neighbourhoodCan Baro                         -4.89359   23.39917  -0.209
## neighbourhoodCarmel                           18.05269   21.67874   0.833
## neighbourhoodCiutat Vella                      5.17335   21.00320   0.246
## neighbourhoodDiagonal Mar - La Mar Bella       2.52654   21.81608   0.116
## neighbourhoodDreta de l'Eixample               7.36383   20.90720   0.352
## neighbourhoodEixample                          8.34876   20.84609   0.400
## neighbourhoodEl Baix Guinardó                  1.40899   21.42145   0.066
## neighbourhoodEl Besòs i el Maresme             7.56102   21.84241   0.346
## neighbourhoodEl Bon Pastor                     2.72680   24.72421   0.110
## neighbourhoodEl Born                           3.47964   21.47404   0.162
## neighbourhoodEl Camp de l'Arpa del Clot       -7.33206   21.27541  -0.345
## neighbourhoodEl Clot                           4.96020   21.67989   0.229
## neighbourhoodEl Coll                          29.51129   29.14518   1.013
## neighbourhoodEl Congrés i els Indians         12.58958   22.13662   0.569
## neighbourhoodel Fort Pienc                     9.86991   21.17010   0.466
## neighbourhoodEl Gòtic                          4.51670   21.12756   0.214
## neighbourhoodEl Poble-sec                      2.46171   21.16763   0.116
## neighbourhoodEl Poblenou                       7.61134   21.36080   0.356
## neighbourhoodEl Putget i Farró                -1.04808   21.22717  -0.049
## neighbourhoodEl Raval                          9.21152   21.04325   0.438
## neighbourhoodGlòries - El Parc                -7.58301   21.37555  -0.355
## neighbourhoodGràcia                            9.80653   20.81584   0.471
## neighbourhoodGuinardó                          7.44524   21.51030   0.346
## neighbourhoodHorta                            -8.48519   26.64170  -0.318
## neighbourhoodHorta-Guinardó                   12.62002   20.85610   0.605
## neighbourhoodL'Antiga Esquerra de l'Eixample   3.58489   20.97582   0.171
## neighbourhoodLa Barceloneta                    3.02710   21.40393   0.141
## neighbourhoodLa Font d'en Fargues             14.32327   29.08695   0.492
## neighbourhoodLa Maternitat i Sant Ramon        4.17613   20.55349   0.203
## neighbourhoodLa Nova Esquerra de l'Eixample   10.01991   21.00567   0.477
## neighbourhoodLa Prosperitat                  -34.74147   35.67909  -0.974
## neighbourhoodLa Sagrada Família                9.39787   20.95887   0.448
## neighbourhoodLa Sagrera                       -3.50231   22.54129  -0.155
## neighbourhoodLa Salut                          9.59270   21.70711   0.442
## neighbourhoodLa Teixonera                     22.47110   25.32633   0.887
## neighbourhoodLa Trinitat Vella                28.12841   29.53684   0.952
## neighbourhoodLa Verneda i La Pau               0.01125   23.67120   0.000
## neighbourhoodLa Vila Olímpica                 -0.47263   22.03475  -0.021
## neighbourhoodLes Corts                         5.11209   20.65352   0.248
## neighbourhoodLes Tres Torres                  14.64719   23.80717   0.615
## neighbourhoodMontbau                          10.58624   26.57013   0.398
## neighbourhoodNavas                           -12.92257   21.88357  -0.591
## neighbourhoodNou Barris                       -7.61793   21.23428  -0.359
## neighbourhoodPedralbes                       -14.86333   22.84037  -0.651
## neighbourhoodPorta                           -58.95588   29.15289  -2.022
## neighbourhoodProvençals del Poblenou          11.87172   22.41982   0.530
## neighbourhoodSant Andreu                      -0.08328   21.21798  -0.004
## neighbourhoodSant Andreu de Palomar           13.64181   22.43753   0.608
## neighbourhoodSant Antoni                       6.88885   21.06979   0.327
## neighbourhoodSant Genís dels Agudells         13.18838   25.17867   0.524
## neighbourhoodSant Gervasi - Galvany           -1.27434   21.16764  -0.060
## neighbourhoodSant Gervasi - la Bonanova      -21.49174   22.60517  -0.951
## neighbourhoodSant Martí                        2.45604   21.00509   0.117
## neighbourhoodSant Martí de Provençals          1.64003   22.03713   0.074
## neighbourhoodSant Pere/Santa Caterina          5.83101   21.12686   0.276
## neighbourhoodSants-Montjuïc                    5.55296   20.92918   0.265
## neighbourhoodSarrià                            7.92799   20.33856   0.390
## neighbourhoodSarrià-Sant Gervasi               0.94816   20.75466   0.046
## neighbourhoodTrinitat Nova                    -1.81837   29.37886  -0.062
## neighbourhoodTuró de la Peira - Can Peguera    6.57138   26.63044   0.247
## neighbourhoodVallcarca i els Penitents         8.77051   21.55017   0.407
## neighbourhoodVerdum - Los Roquetes            24.39287   26.86071   0.908
## neighbourhoodVila de Gràcia                    8.68923   20.88628   0.416
## neighbourhoodVilapicina i la Torre Llobeta     9.94946   22.84562   0.436
##                                              Pr(>|t|)   
## (Intercept)                                   0.00131 **
## neighbourhoodCamp d'en Grassot i Gràcia Nova  0.47105   
## neighbourhoodCan Baro                         0.83435   
## neighbourhoodCarmel                           0.40503   
## neighbourhoodCiutat Vella                     0.80545   
## neighbourhoodDiagonal Mar - La Mar Bella      0.90781   
## neighbourhoodDreta de l'Eixample              0.72469   
## neighbourhoodEixample                         0.68881   
## neighbourhoodEl Baix Guinardó                 0.94756   
## neighbourhoodEl Besòs i el Maresme            0.72923   
## neighbourhoodEl Bon Pastor                    0.91218   
## neighbourhoodEl Born                          0.87128   
## neighbourhoodEl Camp de l'Arpa del Clot       0.73039   
## neighbourhoodEl Clot                          0.81904   
## neighbourhoodEl Coll                          0.31131   
## neighbourhoodEl Congrés i els Indians         0.56957   
## neighbourhoodel Fort Pienc                    0.64108   
## neighbourhoodEl Gòtic                         0.83072   
## neighbourhoodEl Poble-sec                     0.90742   
## neighbourhoodEl Poblenou                      0.72161   
## neighbourhoodEl Putget i Farró                0.96062   
## neighbourhoodEl Raval                         0.66159   
## neighbourhoodGlòries - El Parc                0.72279   
## neighbourhoodGràcia                           0.63758   
## neighbourhoodGuinardó                         0.72926   
## neighbourhoodHorta                            0.75012   
## neighbourhoodHorta-Guinardó                   0.54514   
## neighbourhoodL'Antiga Esquerra de l'Eixample  0.86430   
## neighbourhoodLa Barceloneta                   0.88754   
## neighbourhoodLa Font d'en Fargues             0.62243   
## neighbourhoodLa Maternitat i Sant Ramon       0.83900   
## neighbourhoodLa Nova Esquerra de l'Eixample   0.63337   
## neighbourhoodLa Prosperitat                   0.33024   
## neighbourhoodLa Sagrada Família               0.65388   
## neighbourhoodLa Sagrera                       0.87653   
## neighbourhoodLa Salut                         0.65857   
## neighbourhoodLa Teixonera                     0.37497   
## neighbourhoodLa Trinitat Vella                0.34098   
## neighbourhoodLa Verneda i La Pau              0.99962   
## neighbourhoodLa Vila Olímpica                 0.98289   
## neighbourhoodLes Corts                        0.80452   
## neighbourhoodLes Tres Torres                  0.53842   
## neighbourhoodMontbau                          0.69033   
## neighbourhoodNavas                            0.55487   
## neighbourhoodNou Barris                       0.71979   
## neighbourhoodPedralbes                        0.51523   
## neighbourhoodPorta                            0.04319 * 
## neighbourhoodProvençals del Poblenou          0.59647   
## neighbourhoodSant Andreu                      0.99687   
## neighbourhoodSant Andreu de Palomar           0.54322   
## neighbourhoodSant Antoni                      0.74371   
## neighbourhoodSant Genís dels Agudells         0.60044   
## neighbourhoodSant Gervasi - Galvany           0.95200   
## neighbourhoodSant Gervasi - la Bonanova       0.34177   
## neighbourhoodSant Martí                       0.90692   
## neighbourhoodSant Martí de Provençals         0.94068   
## neighbourhoodSant Pere/Santa Caterina         0.78256   
## neighbourhoodSants-Montjuïc                   0.79077   
## neighbourhoodSarrià                           0.69670   
## neighbourhoodSarrià-Sant Gervasi              0.96356   
## neighbourhoodTrinitat Nova                    0.95065   
## neighbourhoodTuró de la Peira - Can Peguera   0.80510   
## neighbourhoodVallcarca i els Penitents        0.68404   
## neighbourhoodVerdum - Los Roquetes            0.36385   
## neighbourhoodVila de Gràcia                   0.67741   
## neighbourhoodVilapicina i la Torre Llobeta    0.66321   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df      F  p-value    
## s(latitude)             2.020  2.695  0.807 0.503167    
## s(longitude)            5.694  7.000  2.304 0.024207 *  
## s(bathrooms)            1.011  1.021 11.414 0.000688 ***
## s(bedrooms)             4.845  5.790  3.527 0.001735 ** 
## s(accommodates)         4.274  5.216  8.978  < 2e-16 ***
## s(beds)                 2.393  3.030  1.410 0.238039    
## s(price)                5.429  6.518 29.183  < 2e-16 ***
## s(minimum_nights)       4.105  4.790  8.485  1.3e-07 ***
## s(review_scores_rating) 2.431  2.918 18.879  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0723   Deviance explained = 8.75%
## -REML =  28234  Scale est. = 834.69    n = 5943

The GAM model, which is employed to predict occupancy rates in 30 days, explains only 9% of the variability in the data, with a deviance explained of 10.9%.The results indicate that the predictors price, minimum nights, review scores, rating, bathrooms, bedrooms and accommodations are significant for determining occupancy rates, but spatial variables are less so, as is the number of beds. The low R² value indicates that other factors, not included in the dataset, such as seasonality, might play a significant role in determining occupancy rates.

5.8.5 Gam Model Prediction of Occupancy Rate

## R-squared on Test Set (Occupancy 30):  0.05990545

5.8.6 GAM Model: Evaluating Predictive Occupancy Rate Performance

## MAE:  64.63602
## RMSE:  66.07687

The MAE indicates that, on average, the model’s predictions deviate from the observed occupancy rates by approximately 64.47 percentage points.

The RMSE, which has a value of 66, suggests the presence of some outliers, indicating the use of an alternative model.

5.9 Neural Network

5.9.1 Neural Network Model Training for Price Prediction

A single layer neural network with three hidden nodes is trained, using normalize training data. The model used a linear activation function for the outputs.

##               Length Class      Mode    
## call              5  -none-     call    
## response       5943  -none-     numeric 
## covariate     47544  -none-     numeric 
## model.list        2  -none-     list    
## err.fct           1  -none-     function
## act.fct           1  -none-     function
## linear.output     1  -none-     logical 
## data             26  data.frame list    
## exclude           0  -none-     NULL

The R-squared value is 0.285, meaning that the model explains only the 28.5% of the variability in the prices.

5.10 Support Vector Machine

5.10.1 SVM for Price prediction

The SVM model for Price prediction is a regression model (SVR) since the target variable is numeric. We consider the normalized numerical variables since the SVM model is sensitive to the scale of the features. For the SVR model we choose a radial kernel to better treat variables non-linear dependent. Tuning the cost parameter in SVM is crucial to finding the optimal balance between over-fitting and under-fitting. The cost parameter (C) controls the trade-off between having a wide margin and correctly classifying the training set.

# check if price are right skewed - if yes, apply log transformation before fitting the model
hist(train_normalized_encod$price, 
     main = "Histogram of Price", 
     xlab = "Price", 
     col = "lightblue", 
     border = "black", 
     breaks = 30) 

# Apply log transformation to price, adding 1 to handle potential zeros
train_normalized_encod$log_price <- log(train_normalized_encod$price + 1)
# test_normalized_encod$log_price <- log(test_normalized_encod$price + 1)

# visualize new distribution with log transform
hist(train_normalized_encod$log_price, 
     main = "Histogram of Price", 
     xlab = "Price", 
     col = "lightblue", 
     border = "black", 
     breaks = 30)

# Formula for SVM model
formula_svm <- log_price ~ bathrooms + bedrooms + accommodates + beds + latitude + longitude + review_scores_rating + minimum_nights + neighbourhood

# Tuning
set.seed(123)
cv_model <- tune.svm(formula_svm, data=train_normalized_encod, kernel = 'radial', cost=c(0.01, 0.1, 1, 10), gamma=c(0.01, 0.1, 1))
cv_model
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##    0.1   10
## 
## - best performance: 0.004963518

We apply the values of cost and gamma according to the result obtained by tuning the hyper-parameters.

# Train the SVM model
SVM_model <- svm(formula_svm, data=train_normalized_encod, kernel = 'radial', scale = FALSE, cost = 10, gamma = 0.1)
summary(SVM_model)
## 
## Call:
## svm(formula = formula_svm, data = train_normalized_encod, kernel = "radial", 
##     cost = 10, gamma = 0.1, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  528

Since we use SVM for a regression model, we can evaluate it considering the following metrics: - Mean Absolute Error (MAE), - Mean Squared Error (MSE), - Root Mean Squared Error (RMSE).

# Prediction on test data
predictions <- predict(SVM_model, newdata = test_normalized_encod)
# Convert log-transformed predictions back to original price scale
predictions_price <- exp(predictions) - 1

# Calculate metrics
# Mean Absolute Error (MAE)
mae_svm <- mean(abs(predictions_price - test_normalized_encod$price))

# Root Mean Squared Error (RMSE)
rmse_svm <- sqrt(mean((predictions_price - test_normalized_encod$price)^2))

# R-squared
ss_total <- sum((test_normalized$price - mean(test_normalized_encod$price))^2)
ss_residual <- sum((test_normalized_encod$price - predictions_price)^2)
r_squared_svm <- 1 - (ss_residual / ss_total)

# Output metrics
cat("MAE:", mae_svm, "\n")
## MAE: 0.07821782
cat("RMSE:", rmse_svm, "\n")
## RMSE: 0.1063871
cat("R-squared:", r_squared_svm, "\n")
## R-squared: 0.1103046

A Mean Absolute Error (MAE) around 0.07 suggests that, on average, the model’s predictions are off by about 0.07 units from the actual price, that suggest a good performance. However, in general, a RMSE larger than MAE can indicate outliers or extreme errors that influence the result. Eventually, the R-squared value indicates that the model can explain just around 11% of the variance of the price, that is a poor result.

5.10.2 SVM for Occupancy rate prediction

We perform a SVM regression model for Occupancy Rate. Similar to the SVR price model, we consider the normalized numerical variables and encode categorical variables into numerical values. Also in this case, we choose a radial kernel since the predictors have mainly non-linear relationships with the response variable. We then operate the tuning for the cost and gamma hyper-parameters.

# check if occupancy rate are right skewed - if yes, apply log transformation before fitting the model
hist(train_normalized_encod$occupancy_rate_30, 
     main = "Histogram of Occupancy rate", 
     xlab = "Price", 
     col = "lightblue", 
     border = "black", 
     breaks = 30) 

# Formula for SVR model
formula_svr_occupancy <- occupancy_rate_30 ~ bathrooms + bedrooms + accommodates + beds + latitude + longitude + review_scores_rating + minimum_nights + neighbourhood + price

# Tuning
set.seed(123)
cv_model <- tune.svm(formula_svr_occupancy, data=train_normalized_encod, kernel = 'radial', cost=c(0.1, 1, 10, 100), gamma=c(0.01, 0.1, 1, 10))
cv_model$best.model
## 
## Call:
## best.svm(x = formula_svr_occupancy, data = train_normalized_encod, 
##     gamma = c(0.01, 0.1, 1, 10), cost = c(0.1, 1, 10, 100), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  10 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  5433

We apply the values of cost and gamma according to the result obtained by tuning the hyper-parameters.

# Train the SVM model
SVM_occupancy <- svm(formula_svr_occupancy, data=train_normalized_encod, kernel = 'radial', scale = FALSE, cost = 1, gamma = 10, type='eps-regression')

Since we use SVM for a regression model, we can evaluate it considering the following metrics: - Mean Absolute Error (MAE), - Mean Squared Error (MSE), - Root Mean Squared Error (RMSE).

# Prediction on test data
predictions <- predict(SVM_occupancy, newdata = test_normalized_encod)

# Calculate metrics
# Mean Absolute Error (MAE)
mae_svm <- mean(abs(predictions - test_normalized_encod$occupancy_rate_30))

# Root Mean Squared Error (RMSE)
rmse_svm <- sqrt(mean((predictions - test_normalized_encod$occupancy_rate_30)^2))

# R-squared
ss_total <- sum((test_normalized_encod$occupancy_rate_30 - mean(test_normalized_encod$occupancy_rate_30))^2)
ss_residual <- sum((test_normalized_encod$occupancy_rate_30 - predictions)^2)
r_squared_svm <- 1 - (ss_residual / ss_total)

# Output metrics
cat("MAE:", mae_svm, "\n")
## MAE: 22.56461
cat("RMSE:", rmse_svm, "\n")
## RMSE: 31.30801
cat("R-squared:", r_squared_svm, "\n")
## R-squared: -0.09070758

A Mean Absolute Error (MAE) around 28 suggests that, on average, the model’s predictions are off by about 28 units from the actual occupancy rate. Also in this case, a RMSE larger than MAE can indicate outliers or extreme errors that influence the result. Eventually, the R-squared value indicates that the model can explain just around 11% of the variance of the price, that is a poor result. This means the model is performing worse than a simple baseline model that predicts the mean of the target variable for all observations. The model is struggling to generalize or capture the variance in the data.

summary(predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   63.34   81.34   84.65   83.78   87.21   95.06
plot(test_normalized_encod$occupancy_rate_30, predictions,
     xlab = "Actual Values",
     ylab = "Predicted Values",
     main = "VSM-Occupancy: Predictions vs Actuals")

Inspecting th distribution of the data, it appears that the predictions are concentrated around the higher (the 1st quartile is 83.30 that is very close to the maximum value of 95.65).

residuals <- test_normalized_encod$occupancy_rate_30 - predictions
plot(residuals, main = "Residuals", ylab = "Residual Value")
abline(h = 0, col = "red")

6. Use of Generative AI (notes from everyone to collect and formulate)

– how you used generative AI in redacting the group work (code-related questions, generate text, explain concepts…) – what was easy/hard/impossible to do with generative AI – what you had to pay attention to/be critical about when using the results obtained through the use of generative AI

7. Conclusions (together at the end)

  • Summary of key insights
  • Predictive Model Performance (can we answer our research questions?)
  • Implications
  • Limitations
  • Future work

8. References

  • website
  • literature